Greg_Colbourn ⏸️ comments on OpenAI’s o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

Greg_Colbourn ⏸️ 6 Jan 2026 11:52 UTC
3 points
0 ∶ 0
It’s only 8 months later, and the top score on ARC-AGI-2 is now 54%.
- Yarrow Bouchard 🔸 7 Jan 2026 6:02 UTC
  7 points
  0 ∶ 0
  Parent
  In footnote 2 on this post, I said I wouldn’t be surprised if, on January 1, 2026, the top score on ARC-AGI-2 was still below 60%. It did turn out to be below 60%, although only by 6%. (Elon Musk’s prediction of AGI in 2025 was wrong, obviously.)
  The score the ARC Prize Foundation ascribes to human performance is 100%, rather than 60%. 60% is the average for individual humans, but 100% is the score for a “human panel”, i.e. a set of at least two humans. Note the large discrepancy between the average human and the average human panel. The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up. (I vaguely remember this being mentioned in a talk or interview somewhere.)
  
  ARC’s Grand Prize requires scoring 85% (and abiding by certain cost/compute efficiency limits). They say the 85% target score is “somewhat arbitrary”.
  I decided to go with the 60% figure in this post to go easy on the LLMs.
  
  If you haven’t already, I recommend looking at some examples of ARC-AGI-2 tasks. Notice how simple they are. These are just little puzzles. They aren’t that complex. Anyone can do one in a few minutes, even a kid. It helps to see what we’re actually measuring here.
  
  The computer scientist Melanie Mitchell has a great recent talk on this. The whole talk is worth watching, but the part about ARC-AGI-1 and ARC-AGI-2 starts at 21:50. She gives examples of the sort of mistakes LLMs (including o1-pro) make on ARC-AGI tasks and her team’s variations on them. These are really, really simple mistakes. I think you should really look at the example tasks and the example mistakes to get a sense of how rudimentary LLMs’ capabilities are.
  
  I am interested to watch when ARC-AGI-3 launches. ARC-AGI-3 is interactive and there is more variety in the tasks. Just as AI models themselves need to be iterated, benchmarks need to be iterated. It is difficult to make a perfect product or technology on the first try. So, hopefully François Chollet and his colleagues will make better and better benchmarks with each new version of ARC-AGI.
  Unfortunately, the AI researcher Andrew Karpathy has been saying some pretty discouraging things about benchmarks lately. From a tweet from November:
  I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.
  I guess the most egregious publicly known example of an LLM company juicing its numbers on benchmarks was when Meta gamed (cheated on?) some benchmarks with Llama 4. Meta AI’s former chief scientist, Yann LeCun, said in a recent interview that Mark Zuckerberg “basically lost confidence in everyone who was involved in this” (which didn’t include LeCun, who worked in a different division), many of whom have since departed the company.
  
  However, I don’t know where LLM companies draw the line between acceptable gaming (or cheating) and unacceptable gaming (or cheating). For instance, I don’t know if LLM companies are creating their own training datasets with their own versions of ARC-AGI-2 tasks and training on that. It may be that the more an LLM company pays attention to and cares about a benchmark, the less meaningful a measurement it is (and vice versa).
  
  Karpathy again, this time in his December LLM year in review post:
  Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.
  I think probably one of the best measures of AI capabilities is AI’s ability to do economically useful or valuable tasks, in real world scenarios, that can increase productivity or generate profit. This is a more robust test — it isn’t automatically gradable, and it would be very difficult to game or cheat on. To misuse the roboticist Rodney Brooks’ famous phrase, “The world is its own best model.” Rather than test on some simplified, contrived proxy for real world tasks, why not just test on real world tasks?
  
  Moreover, someone has to pay for people to create benchmarks, and to maintain, improve, and operate them. There isn’t a ton of money to do so, especially not for benchmarks like ARC-AGI-2. But there’s basically unlimited money incentivizing companies to measure productivity and profitability, and to try out allegedly labour-saving technologies. After the AI bubble pops (which it inevitably will, probably sometime within the next 5 years or so), this may become less true. But for now, companies are falling over themselves to try to implement and profit from LLMs and generative AI tools. So, funding to test AI performance in real world contexts is currently in abundant supply.
  - Greg_Colbourn ⏸️ 7 Jan 2026 13:40 UTC
    4 points
    1 ∶ 0
    Parent
    The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up [my emphasis]. (I vaguely remember this being mentioned in a talk or interview somewhere.)
    I’m sceptical of this when they were able to earn $5 for every couple of minutes’ work (time to solve a task). This is far above the average hourly wage.
    100% is the score for a “human panel”, i.e. a set of at least two humans.
    Also seems very remarkable (suspect, in fact) - this would mean almost no overlap between the questions that the humans were getting wrong—i.e. if each human averages 60% right, then for 2 humans to get 100% there can only be 20% of questions where both get it right! I think in practice the panels that score 100% have to contain many more than 2 humans on average.
    
    EDIT: looks like “at least 2 humans” means at least 2 humans solved every problem in the set, out of the 400 humans that attempted them!
    - Greg_Colbourn ⏸️ 7 Jan 2026 17:10 UTC
      2 points
      0 ∶ 0
      Parent
      Just thinking: surely to be fair, we should be aggregating all the AI results into an “AI panel”? I wonder how much overlap there is between wrong answers amongst the AIs, and what the aggregate score would be?
      
      Right now, as things stand with the scoring, “AGI” in ARC-AGI-2 means “equivalent to the combined performance of a team of 400 humans”, not “(average) human level”.
      - Yarrow Bouchard 🔸 8 Jan 2026 5:22 UTC
        3 points
        0 ∶ 0
        Parent
        ARC-AGI-2 is not a test of whether a system is an AGI or not. Getting 100% on ARC-AGI-2 would not imply a system is AGI. I guess the name is potentially misleading in that regard. But Chollet et al. are very clear about this.
        The arxiv.org pre-print explains how the human testing worked. See the section “Human-facing calibration testing” on page 5. The human testers only had a maximum of 90 minutes:
        Participants completed a short survey and interface tutorial prior to being assigned tasks. Participants received a base compensation of $115-150 for participation in a 90-minute test session, plus a $5 incentive reward per correctly completed task. Three testing sessions were held between November 2024 and May 2025.
        The median time spent attempting or solving each task was around 2 minutes:
        The median time spent on attempted test pairs was 2.3 minutes, while successfully completed tasks required a median of 2.2 minutes (Figure 3).
        I’m still not entirely sure how the human test process worked from the description in the pre-print, but maybe rather than giving up and walking away, testers gave up on individual tasks in order to solve as many as possible in their allotted time.
        I think you’re probably right about how they’re defining “human panel”, but I wish this were more clearly explained in the pre-print, on the website, or in the presentations they’ve done.
        I can’t respond to your comments in the other thread because of the downvoting, so I’ll reply here:
        1) Metaculus and Manifold have a huge overlap with the EA community (I’m not familiar with Kashi) and, outside the EA community, people who are interested in AGI often far too easily accept the same sort of extremely flawed stuff that presents itself as way more serious and scientific than it really is (e.g. AI 2027, Situational Awareness, Yudkowsky/MIRI’s stuff).
        2) I think it’s very difficult to know if one is engaging in motivated reasoning, or what other psychological biases are in play. People engage in wishful thinking to avoid unpleasant realities or possibilities, but people also invent unpleasant realities/possibilities, including various scenarios around civilizational collapse or the end of the world (e.g. a lot of doomsday preppers seem to believe in profoundly implausible, pseudoscientific, or fringe religious doomsday scenarios). People seem to be both biased toward believing pleasant and unpleasant things. (There is also something psychologically grabbing about believing that one belongs to an elite few who possess esoteric knowledge about cosmic destiny and may play a special role in determining the fate of the world.)
        My explicit, conscious reasoning is complex and can’t be summarized in one sentence (see the posts on my profile for the long version), but it’s less along the lines of ‘I don’t want to believe unpleasant things’ and more along the lines of: a lot people preaching AGI doom lack expertise in AI, have a bad track record of beliefs/predictions on AI and/or other topics, say a lot of suspiciously unfalsifiable and millennialist things, and don’t have clear, compelling answers to objections that have been publicly raised, some for years now.
  - David Mathers🔸 7 Jan 2026 12:15 UTC
    4 points
    1 ∶ 0
    Parent
    N=1, but I looked at an ARC puzzle https://arcprize.org/play?task=e3721c99, and I couldn’t just do it in a few minutes, and I have a PhD from the University of Oxford. I don’t doubt that most of the puzzles are trivial for some humans, and some of the puzzles are trivial for most humans or that I could probably outscore any AI across the whole ARC-2 data set. But at the same time, I am a general intelligence, so being able to solve all ARC puzzles doesn’t seem like a necessary criteria. Maybe this is the opposite of how doing well on benchmarks doesn’t always generalize to real world tasks, and I am just dumb at these but smart overall, and the same could be true for an LLM.
    - Yarrow Bouchard 🔸 8 Jan 2026 6:10 UTC
      2 points
      0 ∶ 0
      Parent
      Ah, okay, that is tricky! I totally missed one of the rules that the examples are telling us about. Once you see it, it seems simple and obvious, but it’s easy to miss. If you want to see the solution, it’s here.
      I believe all ARC-AGI-2 puzzles contain (at least?) two different rules that you have to combine. I forgot about that part! I was trying to solve the puzzle as if there was just one rule to figure out.
      I tried the next puzzle and was able to solve it right away, on the first try, keeping in mind the ‘two rules’ thing. These puzzles are actually pretty fun, I might do more.