tobycrisford 🔸 comments on Metaculus Q4 AI Benchmarking: Bots Are Closing The Gap

tobycrisford 🔸 20 Feb 2025 8:09 UTC
10 points
1 ∶ 0
This is an interesting analysis!
I agree with MaxRa’s point. When I skim read “Metaculus pro forecasters were better than the bot team, but not with statistical significance” I immediately internalised that the message was “bots are getting almost as good as pros” (a message I probably already got from the post title!) and it was only when I forced myself to slow down and read it more carefully that I realised this is not what this result means (for example you could have done this study only using a single question, and this stated result could have been true, but likely not tell you much either way about their relative performance). I only then noticed that both main results were null results. I’m then not sure if this actually supports the ‘Bots are closing the gap’ claim or not..?
The histogram plot is really useful, and the points of reference are helpful too. I’d be interested to know what the histogram would look like if you compared pro human forecasters to average human forecasters on a similar set of questions? How big an effect do we see there? Or maybe to get more directly at what I’m wondering: how do bots compare to average human forecasters? Are they better with statistical significance, or not? Has this study already been done?
- Molly Hickman 13 Mar 2025 20:53 UTC
  5 points
  0 ∶ 0
  Parent
  Sorry I didn’t see this sooner! You and @MaxRa are right, the title is a bit dramatic; indeed, in Q3 and Q4 we got null results. The −8.9 head-to-head score (I like this scoring mechanism a lot) is pretty impressive in my opinion, but again, not statistically significant, and anyway, Max’s point about effect size is well taken (-11.3 to −8.9).
  We’ll take your feedback when we have the Q1 results!
  On how bots compare to average human forecasters: Several of the bots are certainly better than the median forecaster on Metaculus. But relative to the community prediction (a bit more complicated than the average of the forecasts on a given question), the bot team is worse, but again, not with significance. I think we’ll include bots vs CP analysis in the Q1 post, or as a separate thing, soon.