Because the ARC benchmark was specifically designed to be a test of general intelligence (do you disagree that it successfully achieves this?) and because each problem takes the form of requiring you to spot a pattern from only a couple of examples.
I was excited[1] about o3ā²s performance on ARC-AGI-1 initially, but then I read these tweets from Toby Ord:
Finally, I want to note how preposterous the o3-high attempt was. It took 1,024 attempts at each task, writing about 137 pages of text for each attempt, or about 43 million words total. Thatās writing an Encyclopedia Brittanica (44 million words) per task!
And costing about $30,000 for each task. For reference, these are simple puzzles that my 10-year-old child can solve in about 4 minutes. Thatās *something* but not how intelligence solves the puzzle.
This is how FranƧois Chollet, the creator of ARC-AGI-1, characterized o3ā²s results:
I donāt think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.
It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
Passing it means your system exhibits non-zero fluid intelligenceāyouāre finally looking at something that isnāt pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
ARC-AGI-1 and ARC-AGI-2 are the most interesting LLM benchmarks (and Iām sure ARC-AGI-3 will be very interesting when it comes out). The results of o3 (and other LLMs since) on these benchmarks are quite interesting. However, now that I know more detail, I have a hard time getting excited about LLMsā performance.
Iām not sure Cholletās āzero to oneā framing makes sense unless heās just talking about LLMs ā which I guess he probably is. It seems like thereās been a mustard seed of generalization or fluid intelligence in deep learning-based and deep reinforcement learning-based systems for a long time. Maybe go aficionados are just reading too much into stochastic behaviour, but a lot of people were impressed with some of AlphaGoās moves, and called them creative and surprising. That was back in 2016.
If you think AI had literally zero generalization or zero fluid intelligence before o3 and then o3 demonstrated a tiny amount, thatās potentially very exciting. Chollet framing the results in this way is why I was initially excited about o3. But if you think AI has had a tiny amount of generalization or fluid intelligence for a long time and continues to have a tiny amount, the result is much less exciting ā although itās still a fascinating case study to contemplate.
I donāt disagree with much of this comment (to the extent that it puts o3ā²s achievement in its proper context), but I think this is still inconsistent with your original āno progressā claim (whether the progress happened pre or post o3ā²s ARC performance isnāt really relevant). I suppose your point is that the āseed of generalizationā that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think āno progressā is too bold!
But in addition, I think I also disagree with you that there is nothing exciting about o3ā²s ARC performance.
It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. Iāve heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.
But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you donāt expect them to be able to solve ARC problems. They shouldnāt be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesnāt mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!
I think it is surprising and impressive that this worked! This wouldnāt have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still donāt think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.
And Chollet thought it was impressive too, describing it as a āgenuine breakthroughā, despite all the caveats that go with that (that youāve already quoted).
When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we donāt really know how much inference time compute humans are using! (I donāt think? Unless we understand the brain a lot better than I thought we did). I wouldnāt be surprised at all if we find that AGI requires spending a lot of inference-time compute. I donāt think that would make it any less AGI.
The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I donāt think it provides a reason to describe the intelligence as not āgeneralā, in the way that extreme data inefficiency does.
All deep learning systems since 2012 have had some extremely limited generalization ability. If you show AlexNet a picture of an object in a class it has trained on with some novel differences, e.g. maybe itās black-and-white or upside-down or the dog in the photo is wearing a party hat, it will still do much better than chance at classifying the image. In an extremely limited sense, that is generalization.
Iām not sure I can agree with Cholletās āzero to oneā characterization of o3. To be clear, heās saying itās zero to one for fluid intelligence, not generalization, which is a related and similar concept but Chollet defines it a bit differently than generalization. Still, Iām not sure I can agree itās zero to one either with regard to generalization or fluid intelligence. And Iām not sure I can agree itās zero to one even for LLMs. It depends how strict you are about the definitions, how exact you are trying to be, and what, substantively, youāre trying to say.
I think many results in AI are incredibly impressive considered from the perspective of science and technology ā everything from AlexNet to AlphaGo to ChatGPT. But this is a separate question from whether they are closing the gap between AI and general intelligence in any meaningful way. My assessment is that theyāre not. I think o3ās performance on ARC-AGI-1 and now ARC-AGI-2 is a really cool science project, but it doesnāt feel like fundamental research progress on AGI (except in a really limited, narrow sense in which a lot of things would count as that).
AI systems can improve data efficiency, generalization, and other performance characteristics in incremental ways over previous systems and this can still be true. The best image classifiers today get better top-1 performance on ImageNet than the best image classifiers ten years ago, and so they are more data efficient. But itās still true that the image classifiers of 2025 are no closer than the image classifiers of 2015 to emulating the proficiency with which humans see or other mammals see.
The old analogy ā I think Douglas Hofstadter said this ā is that if you climb a tree, you are closer to the Moon than your friend on the ground, but no closer to actually getting to the Moon than they are.
In some very technical sense, essentially any improvement to AI in any domain could be considered as an improvement to data efficiency, generalization, and reliability. If AI is able to do a new kind of task it wasnāt able to do before, its performance along all three characteristics has increased from zero to something. If it was already able but now itās better at it, then its performance has increased from something to something more. But this is such a technicality and misses the substantive point.
Why?
Because the ARC benchmark was specifically designed to be a test of general intelligence (do you disagree that it successfully achieves this?) and because each problem takes the form of requiring you to spot a pattern from only a couple of examples.
I was excited[1] about o3ā²s performance on ARC-AGI-1 initially, but then I read these tweets from Toby Ord:
This is how FranƧois Chollet, the creator of ARC-AGI-1, characterized o3ā²s results:
ARC-AGI-1 and ARC-AGI-2 are the most interesting LLM benchmarks (and Iām sure ARC-AGI-3 will be very interesting when it comes out). The results of o3 (and other LLMs since) on these benchmarks are quite interesting. However, now that I know more detail, I have a hard time getting excited about LLMsā performance.
Iām not sure Cholletās āzero to oneā framing makes sense unless heās just talking about LLMs ā which I guess he probably is. It seems like thereās been a mustard seed of generalization or fluid intelligence in deep learning-based and deep reinforcement learning-based systems for a long time. Maybe go aficionados are just reading too much into stochastic behaviour, but a lot of people were impressed with some of AlphaGoās moves, and called them creative and surprising. That was back in 2016.
If you think AI had literally zero generalization or zero fluid intelligence before o3 and then o3 demonstrated a tiny amount, thatās potentially very exciting. Chollet framing the results in this way is why I was initially excited about o3. But if you think AI has had a tiny amount of generalization or fluid intelligence for a long time and continues to have a tiny amount, the result is much less exciting ā although itās still a fascinating case study to contemplate.
I say excited and not scared because I think AI is a good thing and not risky.
I donāt disagree with much of this comment (to the extent that it puts o3ā²s achievement in its proper context), but I think this is still inconsistent with your original āno progressā claim (whether the progress happened pre or post o3ā²s ARC performance isnāt really relevant). I suppose your point is that the āseed of generalizationā that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think āno progressā is too bold!
But in addition, I think I also disagree with you that there is nothing exciting about o3ā²s ARC performance.
It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. Iāve heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.
But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you donāt expect them to be able to solve ARC problems. They shouldnāt be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesnāt mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!
I think it is surprising and impressive that this worked! This wouldnāt have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still donāt think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.
And Chollet thought it was impressive too, describing it as a āgenuine breakthroughā, despite all the caveats that go with that (that youāve already quoted).
When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we donāt really know how much inference time compute humans are using! (I donāt think? Unless we understand the brain a lot better than I thought we did). I wouldnāt be surprised at all if we find that AGI requires spending a lot of inference-time compute. I donāt think that would make it any less AGI.
The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I donāt think it provides a reason to describe the intelligence as not āgeneralā, in the way that extreme data inefficiency does.
All deep learning systems since 2012 have had some extremely limited generalization ability. If you show AlexNet a picture of an object in a class it has trained on with some novel differences, e.g. maybe itās black-and-white or upside-down or the dog in the photo is wearing a party hat, it will still do much better than chance at classifying the image. In an extremely limited sense, that is generalization.
Iām not sure I can agree with Cholletās āzero to oneā characterization of o3. To be clear, heās saying itās zero to one for fluid intelligence, not generalization, which is a related and similar concept but Chollet defines it a bit differently than generalization. Still, Iām not sure I can agree itās zero to one either with regard to generalization or fluid intelligence. And Iām not sure I can agree itās zero to one even for LLMs. It depends how strict you are about the definitions, how exact you are trying to be, and what, substantively, youāre trying to say.
I think many results in AI are incredibly impressive considered from the perspective of science and technology ā everything from AlexNet to AlphaGo to ChatGPT. But this is a separate question from whether they are closing the gap between AI and general intelligence in any meaningful way. My assessment is that theyāre not. I think o3ās performance on ARC-AGI-1 and now ARC-AGI-2 is a really cool science project, but it doesnāt feel like fundamental research progress on AGI (except in a really limited, narrow sense in which a lot of things would count as that).
AI systems can improve data efficiency, generalization, and other performance characteristics in incremental ways over previous systems and this can still be true. The best image classifiers today get better top-1 performance on ImageNet than the best image classifiers ten years ago, and so they are more data efficient. But itās still true that the image classifiers of 2025 are no closer than the image classifiers of 2015 to emulating the proficiency with which humans see or other mammals see.
The old analogy ā I think Douglas Hofstadter said this ā is that if you climb a tree, you are closer to the Moon than your friend on the ground, but no closer to actually getting to the Moon than they are.
In some very technical sense, essentially any improvement to AI in any domain could be considered as an improvement to data efficiency, generalization, and reliability. If AI is able to do a new kind of task it wasnāt able to do before, its performance along all three characteristics has increased from zero to something. If it was already able but now itās better at it, then its performance has increased from something to something more. But this is such a technicality and misses the substantive point.