For continuous questions we have that the Community Prediction (CP) is roughly on par with the Metaculus Prediction (MP) in terms of CRPS, but the CP fares better than the MP in terms of (continuous) log-score. Unfortunately the “higher = better” convention is used for the log score on the track record page—note that this is not the case for Brier or CRPS, where lower = better. This difference is primarily due to the log-score punishing overconfidence more harshly than the CRPS: The CRPS (as computed here) is bounded between 0 and 1, while the (continuous) log-score is bounded “in the good direction”, but can be arbitrarily bad. And, indeed, looking at the worst performing continuous AI questions shows that the CP was overconfident which is further exacerbated when extremising. This hardly matters if the CRPS is already pretty bad, but it can matter a lot for the log-score. This is not just anecdotal evidence, you can check this yourself on our track record page, filtering for (continuous) AI questions, and checking the “surprisal function” in the “continuous calibration” tab.
> Do you have any thoughts on whether CRPS is preferable to log score? Too many! Most are really about the tradeoffs between local and non-local scoring rules for continuous questions. (Very brief summary: There are so many tradeoffs! Local scoring rules like the log-score have more appealing theoretical properties, but in practice they likely add noise. Personally, I also find local scoring rules more intuitive, but most forecasters seem to disagree.)
I see where you’re coming from with the “true probability” issue. To be honest I don’t think there is a significant disagreement here. I agree it’s a somewhat silly term—that’s why I kept wrapping it in scare quotes—but I think (/hoped) it should be clear from context what is meant by it. (I’m pretty sure you got it, so yay!) Overall, I still prefer to use “true probability” over “resilient probability” because a) I had to look it up, so I assume a few other readers would have to do the same and b) this just opens another can of worms: Now we have to specify what information can be reasonably obtained (“what about the exact initial conditions of the coin flip?”, etc.) in order to avoid a vacuous definition that renders everything bar 0 and 1 “not resilient”. I’m open to changing my mind though, especially if lots of people interpret this the wrong way.
Unfortunately the “higher = better” convention is used for the log score on the track record page—note that this is not the case for Brier or CRPS, where lower = better.
I have corrected my initial comment (using strikethrough like this).
For continuous questions we have that the Community Prediction (CP) is roughly on par with the Metaculus Prediction (MP) in terms of CRPS, but the CP fares better than the MP in terms of (continuous) log-score. Unfortunately the “higher = better” convention is used for the log score on the track record page—note that this is not the case for Brier or CRPS, where lower = better.
This difference is primarily due to the log-score punishing overconfidence more harshly than the CRPS: The CRPS (as computed here) is bounded between 0 and 1, while the (continuous) log-score is bounded “in the good direction”, but can be arbitrarily bad. And, indeed, looking at the worst performing continuous AI questions shows that the CP was overconfident which is further exacerbated when extremising. This hardly matters if the CRPS is already pretty bad, but it can matter a lot for the log-score.
This is not just anecdotal evidence, you can check this yourself on our track record page, filtering for (continuous) AI questions, and checking the “surprisal function” in the “continuous calibration” tab.
> Do you have any thoughts on whether CRPS is preferable to log score?
Too many! Most are really about the tradeoffs between local and non-local scoring rules for continuous questions. (Very brief summary: There are so many tradeoffs! Local scoring rules like the log-score have more appealing theoretical properties, but in practice they likely add noise. Personally, I also find local scoring rules more intuitive, but most forecasters seem to disagree.)
I see where you’re coming from with the “true probability” issue. To be honest I don’t think there is a significant disagreement here. I agree it’s a somewhat silly term—that’s why I kept wrapping it in scare quotes—but I think (/hoped) it should be clear from context what is meant by it. (I’m pretty sure you got it, so yay!)
Overall, I still prefer to use “true probability” over “resilient probability” because a) I had to look it up, so I assume a few other readers would have to do the same and b) this just opens another can of worms: Now we have to specify what information can be reasonably obtained (“what about the exact initial conditions of the coin flip?”, etc.) in order to avoid a vacuous definition that renders everything bar 0 and 1 “not resilient”.
I’m open to changing my mind though, especially if lots of people interpret this the wrong way.
Thanks for the clarifications!
I have corrected my initial comment (using strikethrough like
this).