Thank you so much for doing this! And especially for the recommendations such as looking at scores for similar questions (I am using your excellent sheet as a base for calculating these).
Has something changed radically since you did your work? I am asking because you wrote that Metaculus’ track record page showed overall Brier of 0.126 (community) but when I look now, filtering for questions resolved up until mid March (my guess for when you looked at the numbers), I get the Metaculus track record web page to report a much lower 0.092. I have to set the date of resolution to ~mid March 2021 to get a score of 0.124, which is closer to what you reported.
(A completely irrelevant fact I realized is that the worst performing question on Metaculus seems to be whether FTX would default—I am not sure it means something but it stood out sorely on that diagram and is a bit weird)
I am asking because you wrote that Metaculus’ track record page showed overall Brier of 0.126 (community) but when I look now, filtering for questions resolved up until mid March (my guess for when you looked at the numbers), I get the Metaculus track record web page to report a much lower 0.092.
I think you forgot to select “all times” in the field “evaluated at”. By default (when one opens the page), the questions are evaluated at resolution, which results in a lower Brier score. I get a Brier score of 0.126 when I set the latest resolution date to 13 March 2023 (on which I retrieved the data), and select “all times”:
(A completely irrelevant fact I realized is that the worst performing question on Metaculus seems to be whether FTX would default—I am not sure it means something but it stood out sorely on that diagram and is a bit weird)
Interesting! That was indeed Metaculus’ community worst prediction when assessed at resolution. For reference:
It probably means it is fair to say the FTX collapse was not entirely predictable! Incidently, the worst prediction assessed at all times also involves crypto:
Hi Vasco, I hope you do not mind two follow-up questions: Why does Metaculus default to “resolve time” when in your analysis you think it is better to present “all times”? And given my goal of using Metaculus, which “evaluated at” setting should I pick?
The first vibe I get from this is that Metaculus is cherry picking a method of evaluation that make their predictions look better than they are. But then I think that it cannot be that bad, the crew behind Metaculus seem really scientifically minded and high integrity. So I guess the reason for different methods is that they serve different purposes.
I then spent 10 minutes thinking about what the difference was, got a headache and thought I would ask you in case it takes you 2 minutes to respond or refer me to some online explanation.
My goal is to give “regular” (university educated and well read, but not spent time thinking about risks or forecasting) people confidence in Metaculus’ ability to predict future catastrophes (>10% pop decline in <5 years) as well as the source of these (these types of questions). I want to demonstrate to people these are probably the best estimates available of what threats society and individuals are most likely to face in the coming decades and therefore a good way to think about how to build resilience against these threats.
Thanks again for your excellent work and for you patience with my questions.
Why does Metaculus default to “resolve time” when in your analysis you think it is better to present “all times”? And given my goal of using Metaculus, which “evaluated at” setting should I pick?
The Brier score evaluated at “all times” applies to the whole period during which the question was open. It is the mean Brier score, i.e. the one I would see if I selected a random time during which the question was open. I used it because it contains more information.
I think the setting one should pick depends on the context. If you are looking into:
A question which has already closed, but not yet resolved, I would pick “close time”.
A question which is still open, I would check “all times”, and “other time” matching your current conditions (for example, 1 year “prior to resolve time”). The less data I had for the “other time” option, the more weight I would give to “all times” (everything else equal).
I want to demonstrate to people these are probably the best estimates available of what threats society and individuals are most likely to face in the coming decades and therefore a good way to think about how to build resilience against these threats.
I think it is hard to know how reliable Metaculus’ predictions will be with respect to these questions, as Metaculus’ track record does not yet contain data about long-range questions. There are only 8 questions whose Brier can be evaluated 5 years prior to resolve time. For communicating risk to your audience, one could try to make a case for the possibility of the next few decades being wild (if Metaculus’ nearterm predictions about AI are to be trusted), and the possibility of this being the most important century.
Thanks again for your excellent work and for you patience with my questions.
Thank you so much for doing this! And especially for the recommendations such as looking at scores for similar questions (I am using your excellent sheet as a base for calculating these).
Has something changed radically since you did your work? I am asking because you wrote that Metaculus’ track record page showed overall Brier of 0.126 (community) but when I look now, filtering for questions resolved up until mid March (my guess for when you looked at the numbers), I get the Metaculus track record web page to report a much lower 0.092. I have to set the date of resolution to ~mid March 2021 to get a score of 0.124, which is closer to what you reported.
(A completely irrelevant fact I realized is that the worst performing question on Metaculus seems to be whether FTX would default—I am not sure it means something but it stood out sorely on that diagram and is a bit weird)
Thanks, Ulrik, I am glad you found it useful!
I think you forgot to select “all times” in the field “evaluated at”. By default (when one opens the page), the questions are evaluated at resolution, which results in a lower Brier score. I get a Brier score of 0.126 when I set the latest resolution date to 13 March 2023 (on which I retrieved the data), and select “all times”:
Interesting! That was indeed Metaculus’ community worst prediction when assessed at resolution. For reference:
It probably means it is fair to say the FTX collapse was not entirely predictable! Incidently, the worst prediction assessed at all times also involves crypto:
Hi Vasco, I hope you do not mind two follow-up questions: Why does Metaculus default to “resolve time” when in your analysis you think it is better to present “all times”? And given my goal of using Metaculus, which “evaluated at” setting should I pick?
The first vibe I get from this is that Metaculus is cherry picking a method of evaluation that make their predictions look better than they are. But then I think that it cannot be that bad, the crew behind Metaculus seem really scientifically minded and high integrity. So I guess the reason for different methods is that they serve different purposes.
I then spent 10 minutes thinking about what the difference was, got a headache and thought I would ask you in case it takes you 2 minutes to respond or refer me to some online explanation.
My goal is to give “regular” (university educated and well read, but not spent time thinking about risks or forecasting) people confidence in Metaculus’ ability to predict future catastrophes (>10% pop decline in <5 years) as well as the source of these (these types of questions). I want to demonstrate to people these are probably the best estimates available of what threats society and individuals are most likely to face in the coming decades and therefore a good way to think about how to build resilience against these threats.
Thanks again for your excellent work and for you patience with my questions.
Thanks for the follow-up questions!
The Brier score evaluated at “all times” applies to the whole period during which the question was open. It is the mean Brier score, i.e. the one I would see if I selected a random time during which the question was open. I used it because it contains more information.
I think the setting one should pick depends on the context. If you are looking into:
A question which has already closed, but not yet resolved, I would pick “close time”.
A question which is still open, I would check “all times”, and “other time” matching your current conditions (for example, 1 year “prior to resolve time”). The less data I had for the “other time” option, the more weight I would give to “all times” (everything else equal).
I think it is hard to know how reliable Metaculus’ predictions will be with respect to these questions, as Metaculus’ track record does not yet contain data about long-range questions. There are only 8 questions whose Brier can be evaluated 5 years prior to resolve time. For communicating risk to your audience, one could try to make a case for the possibility of the next few decades being wild (if Metaculus’ nearterm predictions about AI are to be trusted), and the possibility of this being the most important century.
No worries; you are welcome!