Greg_Colbourn ⏸️ comments on OpenAI’s o3 model scores 3% on the ARC-AGI-2 benchmark, compared to 60% for the average human

Greg_Colbourn ⏸️ 7 Jan 2026 17:10 UTC
2 points
0 ∶ 0
Just thinking: surely to be fair, we should be aggregating all the AI results into an “AI panel”? I wonder how much overlap there is between wrong answers amongst the AIs, and what the aggregate score would be?

Right now, as things stand with the scoring, “AGI” in ARC-AGI-2 means “equivalent to the combined performance of a team of 400 humans”, not “(average) human level”.
- Yarrow Bouchard 🔸 8 Jan 2026 5:22 UTC
  3 points
  0 ∶ 0
  Parent
  ARC-AGI-2 is not a test of whether a system is an AGI or not. Getting 100% on ARC-AGI-2 would not imply a system is AGI. I guess the name is potentially misleading in that regard. But Chollet et al. are very clear about this.
  The arxiv.org pre-print explains how the human testing worked. See the section “Human-facing calibration testing” on page 5. The human testers only had a maximum of 90 minutes:
  Participants completed a short survey and interface tutorial prior to being assigned tasks. Participants received a base compensation of $115-150 for participation in a 90-minute test session, plus a $5 incentive reward per correctly completed task. Three testing sessions were held between November 2024 and May 2025.
  The median time spent attempting or solving each task was around 2 minutes:
  The median time spent on attempted test pairs was 2.3 minutes, while successfully completed tasks required a median of 2.2 minutes (Figure 3).
  I’m still not entirely sure how the human test process worked from the description in the pre-print, but maybe rather than giving up and walking away, testers gave up on individual tasks in order to solve as many as possible in their allotted time.
  I think you’re probably right about how they’re defining “human panel”, but I wish this were more clearly explained in the pre-print, on the website, or in the presentations they’ve done.
  I can’t respond to your comments in the other thread because of the downvoting, so I’ll reply here:
  1) Metaculus and Manifold have a huge overlap with the EA community (I’m not familiar with Kashi) and, outside the EA community, people who are interested in AGI often far too easily accept the same sort of extremely flawed stuff that presents itself as way more serious and scientific than it really is (e.g. AI 2027, Situational Awareness, Yudkowsky/MIRI’s stuff).
  2) I think it’s very difficult to know if one is engaging in motivated reasoning, or what other psychological biases are in play. People engage in wishful thinking to avoid unpleasant realities or possibilities, but people also invent unpleasant realities/possibilities, including various scenarios around civilizational collapse or the end of the world (e.g. a lot of doomsday preppers seem to believe in profoundly implausible, pseudoscientific, or fringe religious doomsday scenarios). People seem to be both biased toward believing pleasant and unpleasant things. (There is also something psychologically grabbing about believing that one belongs to an elite few who possess esoteric knowledge about cosmic destiny and may play a special role in determining the fate of the world.)
  My explicit, conscious reasoning is complex and can’t be summarized in one sentence (see the posts on my profile for the long version), but it’s less along the lines of ‘I don’t want to believe unpleasant things’ and more along the lines of: a lot people preaching AGI doom lack expertise in AI, have a bad track record of beliefs/predictions on AI and/or other topics, say a lot of suspiciously unfalsifiable and millennialist things, and don’t have clear, compelling answers to objections that have been publicly raised, some for years now.