I donât disagree with much of this comment (to the extent that it puts o3â˛s achievement in its proper context), but I think this is still inconsistent with your original âno progressâ claim (whether the progress happened pre or post o3â˛s ARC performance isnât really relevant). I suppose your point is that the âseed of generalizationâ that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think âno progressâ is too bold!
But in addition, I think I also disagree with you that there is nothing exciting about o3â˛s ARC performance.
It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. Iâve heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.
But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you donât expect them to be able to solve ARC problems. They shouldnât be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesnât mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!
I think it is surprising and impressive that this worked! This wouldnât have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still donât think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.
And Chollet thought it was impressive too, describing it as a âgenuine breakthroughâ, despite all the caveats that go with that (that youâve already quoted).
When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we donât really know how much inference time compute humans are using! (I donât think? Unless we understand the brain a lot better than I thought we did). I wouldnât be surprised at all if we find that AGI requires spending a lot of inference-time compute. I donât think that would make it any less AGI.
The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I donât think it provides a reason to describe the intelligence as not âgeneralâ, in the way that extreme data inefficiency does.
All deep learning systems since 2012 have had some extremely limited generalization ability. If you show AlexNet a picture of an object in a class it has trained on with some novel differences, e.g. maybe itâs black-and-white or upside-down or the dog in the photo is wearing a party hat, it will still do much better than chance at classifying the image. In an extremely limited sense, that is generalization.
Iâm not sure I can agree with Cholletâs âzero to oneâ characterization of o3. To be clear, heâs saying itâs zero to one for fluid intelligence, not generalization, which is a related and similar concept but Chollet defines it a bit differently than generalization. Still, Iâm not sure I can agree itâs zero to one either with regard to generalization or fluid intelligence. And Iâm not sure I can agree itâs zero to one even for LLMs. It depends how strict you are about the definitions, how exact you are trying to be, and what, substantively, youâre trying to say.
I think many results in AI are incredibly impressive considered from the perspective of science and technology â everything from AlexNet to AlphaGo to ChatGPT. But this is a separate question from whether they are closing the gap between AI and general intelligence in any meaningful way. My assessment is that theyâre not. I think o3âs performance on ARC-AGI-1 and now ARC-AGI-2 is a really cool science project, but it doesnât feel like fundamental research progress on AGI (except in a really limited, narrow sense in which a lot of things would count as that).
AI systems can improve data efficiency, generalization, and other performance characteristics in incremental ways over previous systems and this can still be true. The best image classifiers today get better top-1 performance on ImageNet than the best image classifiers ten years ago, and so they are more data efficient. But itâs still true that the image classifiers of 2025 are no closer than the image classifiers of 2015 to emulating the proficiency with which humans see or other mammals see.
The old analogy â I think Douglas Hofstadter said this â is that if you climb a tree, you are closer to the Moon than your friend on the ground, but no closer to actually getting to the Moon than they are.
In some very technical sense, essentially any improvement to AI in any domain could be considered as an improvement to data efficiency, generalization, and reliability. If AI is able to do a new kind of task it wasnât able to do before, its performance along all three characteristics has increased from zero to something. If it was already able but now itâs better at it, then its performance has increased from something to something more. But this is such a technicality and misses the substantive point.
I donât disagree with much of this comment (to the extent that it puts o3â˛s achievement in its proper context), but I think this is still inconsistent with your original âno progressâ claim (whether the progress happened pre or post o3â˛s ARC performance isnât really relevant). I suppose your point is that the âseed of generalizationâ that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think âno progressâ is too bold!
But in addition, I think I also disagree with you that there is nothing exciting about o3â˛s ARC performance.
It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. Iâve heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.
But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you donât expect them to be able to solve ARC problems. They shouldnât be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesnât mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!
I think it is surprising and impressive that this worked! This wouldnât have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still donât think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.
And Chollet thought it was impressive too, describing it as a âgenuine breakthroughâ, despite all the caveats that go with that (that youâve already quoted).
When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we donât really know how much inference time compute humans are using! (I donât think? Unless we understand the brain a lot better than I thought we did). I wouldnât be surprised at all if we find that AGI requires spending a lot of inference-time compute. I donât think that would make it any less AGI.
The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I donât think it provides a reason to describe the intelligence as not âgeneralâ, in the way that extreme data inefficiency does.
All deep learning systems since 2012 have had some extremely limited generalization ability. If you show AlexNet a picture of an object in a class it has trained on with some novel differences, e.g. maybe itâs black-and-white or upside-down or the dog in the photo is wearing a party hat, it will still do much better than chance at classifying the image. In an extremely limited sense, that is generalization.
Iâm not sure I can agree with Cholletâs âzero to oneâ characterization of o3. To be clear, heâs saying itâs zero to one for fluid intelligence, not generalization, which is a related and similar concept but Chollet defines it a bit differently than generalization. Still, Iâm not sure I can agree itâs zero to one either with regard to generalization or fluid intelligence. And Iâm not sure I can agree itâs zero to one even for LLMs. It depends how strict you are about the definitions, how exact you are trying to be, and what, substantively, youâre trying to say.
I think many results in AI are incredibly impressive considered from the perspective of science and technology â everything from AlexNet to AlphaGo to ChatGPT. But this is a separate question from whether they are closing the gap between AI and general intelligence in any meaningful way. My assessment is that theyâre not. I think o3âs performance on ARC-AGI-1 and now ARC-AGI-2 is a really cool science project, but it doesnât feel like fundamental research progress on AGI (except in a really limited, narrow sense in which a lot of things would count as that).
AI systems can improve data efficiency, generalization, and other performance characteristics in incremental ways over previous systems and this can still be true. The best image classifiers today get better top-1 performance on ImageNet than the best image classifiers ten years ago, and so they are more data efficient. But itâs still true that the image classifiers of 2025 are no closer than the image classifiers of 2015 to emulating the proficiency with which humans see or other mammals see.
The old analogy â I think Douglas Hofstadter said this â is that if you climb a tree, you are closer to the Moon than your friend on the ground, but no closer to actually getting to the Moon than they are.
In some very technical sense, essentially any improvement to AI in any domain could be considered as an improvement to data efficiency, generalization, and reliability. If AI is able to do a new kind of task it wasnât able to do before, its performance along all three characteristics has increased from zero to something. If it was already able but now itâs better at it, then its performance has increased from something to something more. But this is such a technicality and misses the substantive point.