It is interesting that Metaculusâ community predictions are 7.29 % (= 0.103/â0.096) more accurate than Metaculusâ predictions according to CRPS (continuous question). In contrast, based on the log score, Metaculusâ community predictions are 13.1 % (= 0.842/â0.969) worse.
Do you have any thoughts on whether CRPS is preferable to log score? Intuitively, CRPS seems better because it relies on more information about the forecast. The log score only uses the probability density at the resolved values, whereas CRPS taken into account the whole CDF. I think it would be nice to add the CRPS to Metaculusâ track record page.
This shows that even a perfectly calibrated forecaster will achieve a Brier score worse than 0.207 when the true probability of a question is between 30% and 70%.
This is probably a nitpick, but I am not sure I agree with your framing of âtrue probabilityâ. Even for a coin flip, where one usually says there is a 50 % chance of heads/âtails, the true probability of heads/âtails will be 0 or 1, in the sense that in theory one could predict the outcome with near certainty having all the relevant information. So I think I would prefer the term resilient probability, i.e. one that is unlikely to be updated further in response to new information (in the same way that one is unlikely to update away from 50â50 in a coin flip).
For continuous questions we have that the Community Prediction (CP) is roughly on par with the Metaculus Prediction (MP) in terms of CRPS, but the CP fares better than the MP in terms of (continuous) log-score. Unfortunately the âhigher = betterâ convention is used for the log score on the track record pageânote that this is not the case for Brier or CRPS, where lower = better. This difference is primarily due to the log-score punishing overconfidence more harshly than the CRPS: The CRPS (as computed here) is bounded between 0 and 1, while the (continuous) log-score is bounded âin the good directionâ, but can be arbitrarily bad. And, indeed, looking at the worst performing continuous AI questions shows that the CP was overconfident which is further exacerbated when extremising. This hardly matters if the CRPS is already pretty bad, but it can matter a lot for the log-score. This is not just anecdotal evidence, you can check this yourself on our track record page, filtering for (continuous) AI questions, and checking the âsurprisal functionâ in the âcontinuous calibrationâ tab.
> Do you have any thoughts on whether CRPS is preferable to log score? Too many! Most are really about the tradeoffs between local and non-local scoring rules for continuous questions. (Very brief summary: There are so many tradeoffs! Local scoring rules like the log-score have more appealing theoretical properties, but in practice they likely add noise. Personally, I also find local scoring rules more intuitive, but most forecasters seem to disagree.)
I see where youâre coming from with the âtrue probabilityâ issue. To be honest I donât think there is a significant disagreement here. I agree itâs a somewhat silly termâthatâs why I kept wrapping it in scare quotesâbut I think (/âhoped) it should be clear from context what is meant by it. (Iâm pretty sure you got it, so yay!) Overall, I still prefer to use âtrue probabilityâ over âresilient probabilityâ because a) I had to look it up, so I assume a few other readers would have to do the same and b) this just opens another can of worms: Now we have to specify what information can be reasonably obtained (âwhat about the exact initial conditions of the coin flip?â, etc.) in order to avoid a vacuous definition that renders everything bar 0 and 1 ânot resilientâ. Iâm open to changing my mind though, especially if lots of people interpret this the wrong way.
Unfortunately the âhigher = betterâ convention is used for the log score on the track record pageânote that this is not the case for Brier or CRPS, where lower = better.
I have corrected my initial comment (using strikethrough like this).
I am glad you did this. Thanks!
It is interesting that Metaculusâ community predictions are 7.29 % (= 0.103/â0.096) more accurate than Metaculusâ predictions according to CRPS (continuous question). In contrast, based on the log score, Metaculusâ community predictions are 13.1 % (= 0.842/â0.969) worse.
Do you have any thoughts on whether CRPS is preferable to log score? Intuitively, CRPS seems better because it relies on more information about the forecast. The log score only uses the probability density at the resolved values, whereas CRPS taken into account the whole CDF. I think it would be nice to add the CRPS to Metaculusâ track record page.
This is probably a nitpick, but I am not sure I agree with your framing of âtrue probabilityâ. Even for a coin flip, where one usually says there is a 50 % chance of heads/âtails, the true probability of heads/âtails will be 0 or 1, in the sense that in theory one could predict the outcome with near certainty having all the relevant information. So I think I would prefer the term resilient probability, i.e. one that is unlikely to be updated further in response to new information (in the same way that one is unlikely to update away from 50â50 in a coin flip).
For continuous questions we have that the Community Prediction (CP) is roughly on par with the Metaculus Prediction (MP) in terms of CRPS, but the CP fares better than the MP in terms of (continuous) log-score. Unfortunately the âhigher = betterâ convention is used for the log score on the track record pageânote that this is not the case for Brier or CRPS, where lower = better.
This difference is primarily due to the log-score punishing overconfidence more harshly than the CRPS: The CRPS (as computed here) is bounded between 0 and 1, while the (continuous) log-score is bounded âin the good directionâ, but can be arbitrarily bad. And, indeed, looking at the worst performing continuous AI questions shows that the CP was overconfident which is further exacerbated when extremising. This hardly matters if the CRPS is already pretty bad, but it can matter a lot for the log-score.
This is not just anecdotal evidence, you can check this yourself on our track record page, filtering for (continuous) AI questions, and checking the âsurprisal functionâ in the âcontinuous calibrationâ tab.
> Do you have any thoughts on whether CRPS is preferable to log score?
Too many! Most are really about the tradeoffs between local and non-local scoring rules for continuous questions. (Very brief summary: There are so many tradeoffs! Local scoring rules like the log-score have more appealing theoretical properties, but in practice they likely add noise. Personally, I also find local scoring rules more intuitive, but most forecasters seem to disagree.)
I see where youâre coming from with the âtrue probabilityâ issue. To be honest I donât think there is a significant disagreement here. I agree itâs a somewhat silly termâthatâs why I kept wrapping it in scare quotesâbut I think (/âhoped) it should be clear from context what is meant by it. (Iâm pretty sure you got it, so yay!)
Overall, I still prefer to use âtrue probabilityâ over âresilient probabilityâ because a) I had to look it up, so I assume a few other readers would have to do the same and b) this just opens another can of worms: Now we have to specify what information can be reasonably obtained (âwhat about the exact initial conditions of the coin flip?â, etc.) in order to avoid a vacuous definition that renders everything bar 0 and 1 ânot resilientâ.
Iâm open to changing my mind though, especially if lots of people interpret this the wrong way.
Thanks for the clarifications!
I have corrected my initial comment (using strikethrough like
this).