This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT / RL / long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:
Yes the scores are still very low, but it could just be a case of the models not yet “grokking” such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).
I was not being disingenuous and I find your use of the word “disingenuous” here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
So that we don’t miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is François Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:
I don’t think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.
It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
Passing it means your system exhibits non-zero fluid intelligence—you’re finally looking at something that isn’t pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.
Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to “brute-force program search”) and ARC-AGI-2 is not a perfectly designed benchmark, either.
ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. It’s the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.
I don’t know of other attempts to benchmark general intelligence (or “fluid intelligence”) or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.
One suggestion I’ve read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same “puzzle game” (my words).
There’s a connection between frontier AI models’ failures on a relatively simple “puzzle game” like ARC-AGI-2 and why we don’t see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.
I understand the theory that AI will have a super fast takeoff, so that even though it isn’t very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.
It’s important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good “benchmarks” are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?
This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.
For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional “Software 1.0” was largely already adequate, is somewhat cool and impressive, but doesn’t feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.
To frame the question properly would require thought, time, and research.
I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:
“We argue that ARC [ARC-AGI 1] can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.”
And the original announcement (from June 2024) says:
A solution to ARC-AGI [1], at a minimum, opens up a completely new programming paradigm where programs can perfectly and reliably generalize from an arbitrary set of priors. We also believe a solution is on the critical path towards AGI”
(And ARC-AGI 1 has now basically been solved). You say:
I understand the theory that AI will have a super fast takeoff, so that even though it isn’t very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present.
But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.
In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something he’s saying only now in retrospect that the ARC-AGI tasks have been mostly solved.
That first quote, from the 2019 paper, is consistent with Chollet’s January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I don’t know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.
In response to the graph… Just showing a graph go up does not amount to a “trajectory to automating AGI development”. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPT’s release in November 2022 and today.
In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZero’s tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.
I was not being disingenuous and I find your use of the word “disingenuous” here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for “gpt-4.5 (Pure LLM)”, and “o3-mini-high (Single CoT)”; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)
It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.
I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?
I hope that some people engage in soul searching about why they believed AGI was imminent when it wasn’t. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.
I don’t think it’s nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.
I am curious to see what will happen in 5 years when there is no AGI.
If this happens, we will at least know a lot more about how AGI works (or doesn’t). I’ll be happy to admit I’m wrong (I mean, I’ll be happy to still be around, for a start[1]).
I think the most likely reason we won’t have AGI in 5 years is that there will be a global moratorium on further development. This is what I’m pushing for.
A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s
Then it’s a good thing I didn’t claim there was “a trend that is all flat 0s” in the comment you called “disingenuous”. I said:
It’s only with the o3-low and o1-pro models we see scores above 0% — but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. There’s a nuanced discussion to be had about that topic.
This feels like such a small detail to focus on. It feels ridiculous.
This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT / RL / long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:
Yes the scores are still very low, but it could just be a case of the models not yet “grokking” such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).
I was not being disingenuous and I find your use of the word “disingenuous” here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
So that we don’t miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is François Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:
To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.
Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to “brute-force program search”) and ARC-AGI-2 is not a perfectly designed benchmark, either.
ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. It’s the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.
I don’t know of other attempts to benchmark general intelligence (or “fluid intelligence”) or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.
One suggestion I’ve read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same “puzzle game” (my words).
There’s a connection between frontier AI models’ failures on a relatively simple “puzzle game” like ARC-AGI-2 and why we don’t see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.
I understand the theory that AI will have a super fast takeoff, so that even though it isn’t very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.
It’s important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good “benchmarks” are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?
This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.
For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional “Software 1.0” was largely already adequate, is somewhat cool and impressive, but doesn’t feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.
To frame the question properly would require thought, time, and research.
I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:
And the original announcement (from June 2024) says:
(And ARC-AGI 1 has now basically been solved). You say:
But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.
In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something he’s saying only now in retrospect that the ARC-AGI tasks have been mostly solved.
That first quote, from the 2019 paper, is consistent with Chollet’s January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I don’t know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.
In response to the graph… Just showing a graph go up does not amount to a “trajectory to automating AGI development”. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPT’s release in November 2022 and today.
In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZero’s tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.
GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for “gpt-4.5 (Pure LLM)”, and “o3-mini-high (Single CoT)”; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)
It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.
I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?
I hope that some people engage in soul searching about why they believed AGI was imminent when it wasn’t. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.
I don’t think it’s nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.
If this happens, we will at least know a lot more about how AGI works (or doesn’t). I’ll be happy to admit I’m wrong (I mean, I’ll be happy to still be around, for a start[1]).
I think the most likely reason we won’t have AGI in 5 years is that there will be a global moratorium on further development. This is what I’m pushing for.
Then it’s a good thing I didn’t claim there was “a trend that is all flat 0s” in the comment you called “disingenuous”. I said:
This feels like such a small detail to focus on. It feels ridiculous.