Great catch, this seems important and I didn’t realize it. The ForecastBench paper has some comparisons between humans and LLMs which probably don’t have access to human forecasts. In the tables below, they’re the rows where “information provided” is “news.” These models don’t have open access to the internet; they can only pull summaries of news articles through a custom API, so unless those news articles are citing prediction markets, the LLM isn’t getting information about prediction market forecasts.
The Brier score difference between LLM forecasters with and without access to human crowd forecasts is roughly the same as the Brier score difference between the superforecaster median and the public median. (Though I’m not sure how to interpret that, a Brier score is a weird metric.)
Agreed this seems like an important shortcoming of existing research. I’d love to see future work that measures the accuracy of LLM forecasters with access to the internet but no access to prediction markets or human crowd forecasts. This could be implemented by instructing the LLM not to look at crowd forecasts when surfing the internet, then asking another LLM to verify that the instruction was followed, and resampling if not.
Great catch, this seems important and I didn’t realize it. The ForecastBench paper has some comparisons between humans and LLMs which probably don’t have access to human forecasts. In the tables below, they’re the rows where “information provided” is “news.” These models don’t have open access to the internet; they can only pull summaries of news articles through a custom API, so unless those news articles are citing prediction markets, the LLM isn’t getting information about prediction market forecasts.
The Brier score difference between LLM forecasters with and without access to human crowd forecasts is roughly the same as the Brier score difference between the superforecaster median and the public median. (Though I’m not sure how to interpret that, a Brier score is a weird metric.)
Agreed this seems like an important shortcoming of existing research. I’d love to see future work that measures the accuracy of LLM forecasters with access to the internet but no access to prediction markets or human crowd forecasts. This could be implemented by instructing the LLM not to look at crowd forecasts when surfing the internet, then asking another LLM to verify that the instruction was followed, and resampling if not.