[Disclaimer: I’m working for FutureSearch]
on some readings of your post, “forecasting” becomes very broad and just encompasses all of research.
To add another perspective: Reasoning helps aggregating forecasts. Just consider one of the motivating examples for extremising, where, IIRC, some US president is handed the several (well-calibrated, say) estimates around ≈70% for P(head of some terrorist organisation is in location X)—if these estimates came from different sources, the aggregate ought to be bigger than 70%, whereas if it’s all based on the same few sources, 70% may be one’s best guess.
This is also something that a lot of forecasters may just do subconsciously when considering different points of view (which may be something as simple as different base rates or something as complicated as different AGI arrival models).
So from an engineering perspective there is a lot of value in providing rationales, even if they don’t show up in the final forecasts.
For continuous questions we have that the Community Prediction (CP) is roughly on par with the Metaculus Prediction (MP) in terms of CRPS, but the CP fares better than the MP in terms of (continuous) log-score. Unfortunately the “higher = better” convention is used for the log score on the track record page—note that this is not the case for Brier or CRPS, where lower = better.
This difference is primarily due to the log-score punishing overconfidence more harshly than the CRPS: The CRPS (as computed here) is bounded between 0 and 1, while the (continuous) log-score is bounded “in the good direction”, but can be arbitrarily bad. And, indeed, looking at the worst performing continuous AI questions shows that the CP was overconfident which is further exacerbated when extremising. This hardly matters if the CRPS is already pretty bad, but it can matter a lot for the log-score.
This is not just anecdotal evidence, you can check this yourself on our track record page, filtering for (continuous) AI questions, and checking the “surprisal function” in the “continuous calibration” tab.
> Do you have any thoughts on whether CRPS is preferable to log score?
Too many! Most are really about the tradeoffs between local and non-local scoring rules for continuous questions. (Very brief summary: There are so many tradeoffs! Local scoring rules like the log-score have more appealing theoretical properties, but in practice they likely add noise. Personally, I also find local scoring rules more intuitive, but most forecasters seem to disagree.)
I see where you’re coming from with the “true probability” issue. To be honest I don’t think there is a significant disagreement here. I agree it’s a somewhat silly term—that’s why I kept wrapping it in scare quotes—but I think (/hoped) it should be clear from context what is meant by it. (I’m pretty sure you got it, so yay!)
Overall, I still prefer to use “true probability” over “resilient probability” because a) I had to look it up, so I assume a few other readers would have to do the same and b) this just opens another can of worms: Now we have to specify what information can be reasonably obtained (“what about the exact initial conditions of the coin flip?”, etc.) in order to avoid a vacuous definition that renders everything bar 0 and 1 “not resilient”.
I’m open to changing my mind though, especially if lots of people interpret this the wrong way.