This is an interesting analysis!
I agree with MaxRaās point. When I skim read āMetaculus pro forecasters were better than the bot team, but not with statistical significanceā I immediately internalised that the message was ābots are getting almost as good as prosā (a message I probably already got from the post title!) and it was only when I forced myself to slow down and read it more carefully that I realised this is not what this result means (for example you could have done this study only using a single question, and this stated result could have been true, but likely not tell you much either way about their relative performance). I only then noticed that both main results were null results. Iām then not sure if this actually supports the āBots are closing the gapā claim or not..?
The histogram plot is really useful, and the points of reference are helpful too. Iād be interested to know what the histogram would look like if you compared pro human forecasters to average human forecasters on a similar set of questions? How big an effect do we see there? Or maybe to get more directly at what Iām wondering: how do bots compare to average human forecasters? Are they better with statistical significance, or not? Has this study already been done?
This take seems to contradict Francois Cholletās own write-up of the o3 ARC results, where he describes the results as:
(taken from your reference 52 , emphasis mine)
You could write this off as him wanting to talk-up the significance of his own benchmark, but Iām not sure that would be right. He has been very publicly sceptical of the ability of LLMs to scale to general intelligence, so this is a kind of concession from him. And he had already laid the groundwork in his Dwarkesh Patel interview to explain away high ARC performance as cheating if it tackled the problem in the wrong way, cracking it through memorization via an alternative route (e.g. auto-generating millions of ARC-like problems and training on those). He could easily have dismissed the o3 results on those grounds, but chose not to, which made an impression on me (a non-expert trying to decide how to weigh up the opions of different experts). Presumably he is aware that o3 trained on the public dataset, and doesnāt view that as cheating. The public dataset is small, and the problems are explicitly designed to resist memorization, requiring general intelligence. Being told the solution to earlier problems is not supposed to help you solve later problems.
Whatās your take on this? Do you disagree with the write up in [52]? Or do you think Iām mischaracterizing his position (there are plenty of caveats outside the bit I selectively quoted as wellāso maybe I am)?
The fact that the human-level ARC performance could only be achieved by extremely high inference-time compute costs seems significant too. Why would we get inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, instead of real reasoning?
For context, I used to be pretty sympathetic to the āLLMs do most of the impressive stuff by memorization and are pretty terrible at novel tasksā position, and still think this is a good model for the non-reasoning LLMs, but my views have changed a lot since the reasoning models, particularly because of the ARC results.