I am asking because you wrote that Metaculusā track record page showed overall Brier of 0.126 (community) but when I look now, filtering for questions resolved up until mid March (my guess for when you looked at the numbers), I get the Metaculus track record web page to report a much lower 0.092.
I think you forgot to select āall timesā in the field āevaluated atā. By default (when one opens the page), the questions are evaluated at resolution, which results in a lower Brier score. I get a Brier score of 0.126 when I set the latest resolution date to 13 March 2023 (on which I retrieved the data), and select āall timesā:
(A completely irrelevant fact I realized is that the worst performing question on Metaculus seems to be whether FTX would defaultāI am not sure it means something but it stood out sorely on that diagram and is a bit weird)
Interesting! That was indeed Metaculusā community worst prediction when assessed at resolution. For reference:
It probably means it is fair to say the FTX collapse was not entirely predictable! Incidently, the worst prediction assessed at all times also involves crypto:
Hi Vasco, I hope you do not mind two follow-up questions: Why does Metaculus default to āresolve timeā when in your analysis you think it is better to present āall timesā? And given my goal of using Metaculus, which āevaluated atā setting should I pick?
The first vibe I get from this is that Metaculus is cherry picking a method of evaluation that make their predictions look better than they are. But then I think that it cannot be that bad, the crew behind Metaculus seem really scientifically minded and high integrity. So I guess the reason for different methods is that they serve different purposes.
I then spent 10 minutes thinking about what the difference was, got a headache and thought I would ask you in case it takes you 2 minutes to respond or refer me to some online explanation.
My goal is to give āregularā (university educated and well read, but not spent time thinking about risks or forecasting) people confidence in Metaculusā ability to predict future catastrophes (>10% pop decline in <5 years) as well as the source of these (these types of questions). I want to demonstrate to people these are probably the best estimates available of what threats society and individuals are most likely to face in the coming decades and therefore a good way to think about how to build resilience against these threats.
Thanks again for your excellent work and for you patience with my questions.
Why does Metaculus default to āresolve timeā when in your analysis you think it is better to present āall timesā? And given my goal of using Metaculus, which āevaluated atā setting should I pick?
The Brier score evaluated at āall timesā applies to the whole period during which the question was open. It is the mean Brier score, i.e. the one I would see if I selected a random time during which the question was open. I used it because it contains more information.
I think the setting one should pick depends on the context. If you are looking into:
A question which has already closed, but not yet resolved, I would pick āclose timeā.
A question which is still open, I would check āall timesā, and āother timeā matching your current conditions (for example, 1 year āprior to resolve timeā). The less data I had for the āother timeā option, the more weight I would give to āall timesā (everything else equal).
I want to demonstrate to people these are probably the best estimates available of what threats society and individuals are most likely to face in the coming decades and therefore a good way to think about how to build resilience against these threats.
I think it is hard to know how reliable Metaculusā predictions will be with respect to these questions, as Metaculusā track record does not yet contain data about long-range questions. There are only 8 questions whose Brier can be evaluated 5 years prior to resolve time. For communicating risk to your audience, one could try to make a case for the possibility of the next few decades being wild (if Metaculusā nearterm predictions about AI are to be trusted), and the possibility of this being the most important century.
Thanks again for your excellent work and for you patience with my questions.
Thanks, Ulrik, I am glad you found it useful!
I think you forgot to select āall timesā in the field āevaluated atā. By default (when one opens the page), the questions are evaluated at resolution, which results in a lower Brier score. I get a Brier score of 0.126 when I set the latest resolution date to 13 March 2023 (on which I retrieved the data), and select āall timesā:
Interesting! That was indeed Metaculusā community worst prediction when assessed at resolution. For reference:
It probably means it is fair to say the FTX collapse was not entirely predictable! Incidently, the worst prediction assessed at all times also involves crypto:
Hi Vasco, I hope you do not mind two follow-up questions: Why does Metaculus default to āresolve timeā when in your analysis you think it is better to present āall timesā? And given my goal of using Metaculus, which āevaluated atā setting should I pick?
The first vibe I get from this is that Metaculus is cherry picking a method of evaluation that make their predictions look better than they are. But then I think that it cannot be that bad, the crew behind Metaculus seem really scientifically minded and high integrity. So I guess the reason for different methods is that they serve different purposes.
I then spent 10 minutes thinking about what the difference was, got a headache and thought I would ask you in case it takes you 2 minutes to respond or refer me to some online explanation.
My goal is to give āregularā (university educated and well read, but not spent time thinking about risks or forecasting) people confidence in Metaculusā ability to predict future catastrophes (>10% pop decline in <5 years) as well as the source of these (these types of questions). I want to demonstrate to people these are probably the best estimates available of what threats society and individuals are most likely to face in the coming decades and therefore a good way to think about how to build resilience against these threats.
Thanks again for your excellent work and for you patience with my questions.
Thanks for the follow-up questions!
The Brier score evaluated at āall timesā applies to the whole period during which the question was open. It is the mean Brier score, i.e. the one I would see if I selected a random time during which the question was open. I used it because it contains more information.
I think the setting one should pick depends on the context. If you are looking into:
A question which has already closed, but not yet resolved, I would pick āclose timeā.
A question which is still open, I would check āall timesā, and āother timeā matching your current conditions (for example, 1 year āprior to resolve timeā). The less data I had for the āother timeā option, the more weight I would give to āall timesā (everything else equal).
I think it is hard to know how reliable Metaculusā predictions will be with respect to these questions, as Metaculusā track record does not yet contain data about long-range questions. There are only 8 questions whose Brier can be evaluated 5 years prior to resolve time. For communicating risk to your audience, one could try to make a case for the possibility of the next few decades being wild (if Metaculusā nearterm predictions about AI are to be trusted), and the possibility of this being the most important century.
No worries; you are welcome!