I was excited by ForecastBench and FutureEval both projecting that LLMs would reach superforecaster parity by June 2027.But I didn’t realise access to human crowd forecasts might be driving a lot of performance.If it is, that is massively disappointing.
The top LLM performers in ForecastBench have access to the crowd forecast (and it’s not clear to me if FutureEval hides crowd forecasts—Metaculus did for the Quarterly Cup in 2025 but I couldn’t find info about FutureEval). Skimming the literature with Claude, it seems like most studies either deliberately provide crowd forecasts or don’t prevent searching for it, and those that hide it tend to have significantly worse results (still interesting, but less exciting).
To me, the potential wonders of LLM superforecasting is being able to get excellent guesses at any questions I might come up with. If I need to already have a human crowd or market forecast for the guess to be any good, then the kind of LLM superforecasting being projected is about 10% as useful to me. I still expect ‘true’ parity eventually, but it becomes a story of general timelines rather than empirical projection.
I don’t know the field well, and I’m probably misunderstanding something. I’m posting this to find out I’m wrong. If I’m right, then it’s worth dampening the expectations of anyone else who was imagining having an instant team of supers at their beck-and-call in ~14 months time.
Great catch, this seems important and I didn’t realize it. The ForecastBench paper has some comparisons between humans and LLMs which probably don’t have access to human forecasts. In the tables below, they’re the rows where “information provided” is “news.” These models don’t have open access to the internet; they can only pull summaries of news articles through a custom API, so unless those news articles are citing prediction markets, the LLM isn’t getting information about prediction market forecasts.
The Brier score difference between LLM forecasters with and without access to human crowd forecasts is roughly the same as the Brier score difference between the superforecaster median and the public median. (Though I’m not sure how to interpret that, a Brier score is a weird metric.)
Agreed this seems like an important shortcoming of existing research. I’d love to see future work that measures the accuracy of LLM forecasters with access to the internet but no access to prediction markets or human crowd forecasts. This could be implemented by instructing the LLM not to look at crowd forecasts when surfing the internet, then asking another LLM to verify that the instruction was followed, and resampling if not.
Having followed a lot of AI benchmarks over the years, my main heuristic takeaway regarding expert-parity claims is “prepare to be disappointed once you dig in”, alongside “but they were still useful in advancing understanding and progress”, cf. SemiAnalysis’ Benchmarks are bad but we need to keep using them anyways section for an outside-of-EA perspective. I’m also less bullish on long-range poor-feedback loops superforecasting more generally for reasons along the lines of superforecaster Eli Lifland’s takes (esp. #2 and #4), Dan Luu’s appendix notes and comparisons to the actually-accurate futurists his review found, nostalgebraist on metaculus badness, etc which collectively reduce my enthusiasm for automating this.
I was excited by ForecastBench and FutureEval both projecting that LLMs would reach superforecaster parity by June 2027. But I didn’t realise access to human crowd forecasts might be driving a lot of performance. If it is, that is massively disappointing.
The top LLM performers in ForecastBench have access to the crowd forecast (and it’s not clear to me if FutureEval hides crowd forecasts—Metaculus did for the Quarterly Cup in 2025 but I couldn’t find info about FutureEval). Skimming the literature with Claude, it seems like most studies either deliberately provide crowd forecasts or don’t prevent searching for it, and those that hide it tend to have significantly worse results (still interesting, but less exciting).
To me, the potential wonders of LLM superforecasting is being able to get excellent guesses at any questions I might come up with. If I need to already have a human crowd or market forecast for the guess to be any good, then the kind of LLM superforecasting being projected is about 10% as useful to me. I still expect ‘true’ parity eventually, but it becomes a story of general timelines rather than empirical projection.
I don’t know the field well, and I’m probably misunderstanding something. I’m posting this to find out I’m wrong. If I’m right, then it’s worth dampening the expectations of anyone else who was imagining having an instant team of supers at their beck-and-call in ~14 months time.
Great catch, this seems important and I didn’t realize it. The ForecastBench paper has some comparisons between humans and LLMs which probably don’t have access to human forecasts. In the tables below, they’re the rows where “information provided” is “news.” These models don’t have open access to the internet; they can only pull summaries of news articles through a custom API, so unless those news articles are citing prediction markets, the LLM isn’t getting information about prediction market forecasts.
The Brier score difference between LLM forecasters with and without access to human crowd forecasts is roughly the same as the Brier score difference between the superforecaster median and the public median. (Though I’m not sure how to interpret that, a Brier score is a weird metric.)
Agreed this seems like an important shortcoming of existing research. I’d love to see future work that measures the accuracy of LLM forecasters with access to the internet but no access to prediction markets or human crowd forecasts. This could be implemented by instructing the LLM not to look at crowd forecasts when surfing the internet, then asking another LLM to verify that the instruction was followed, and resampling if not.
Having followed a lot of AI benchmarks over the years, my main heuristic takeaway regarding expert-parity claims is “prepare to be disappointed once you dig in”, alongside “but they were still useful in advancing understanding and progress”, cf. SemiAnalysis’ Benchmarks are bad but we need to keep using them anyways section for an outside-of-EA perspective. I’m also less bullish on long-range poor-feedback loops superforecasting more generally for reasons along the lines of superforecaster Eli Lifland’s takes (esp. #2 and #4), Dan Luu’s appendix notes and comparisons to the actually-accurate futurists his review found, nostalgebraist on metaculus badness, etc which collectively reduce my enthusiasm for automating this.