I don’t disagree with much of this comment (to the extent that it puts o3′s achievement in its proper context), but I think this is still inconsistent with your original “no progress” claim (whether the progress happened pre or post o3′s ARC performance isn’t really relevant). I suppose your point is that the “seed of generalization” that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think “no progress” is too bold!
But in addition, I think I also disagree with you that there is nothing exciting about o3′s ARC performance.
It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. I’ve heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.
But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you don’t expect them to be able to solve ARC problems. They shouldn’t be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesn’t mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!
I think it is surprising and impressive that this worked! This wouldn’t have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still don’t think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.
And Chollet thought it was impressive too, describing it as a “genuine breakthrough”, despite all the caveats that go with that (that you’ve already quoted).
When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we don’t really know how much inference time compute humans are using! (I don’t think? Unless we understand the brain a lot better than I thought we did). I wouldn’t be surprised at all if we find that AGI requires spending a lot of inference-time compute. I don’t think that would make it any less AGI.
The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I don’t think it provides a reason to describe the intelligence as not “general”, in the way that extreme data inefficiency does.
There is a big difference between veganism and most(?) other boycott campaigns. Every time you purchase an animal product then you are causing significant direct harm (in expectation, if you accept the vegan argument). This is because if demand for animal products increases by 1, then we should expect some fraction more of that product to be produced to meet that demand, on average (the particular fraction depending on price elasticity, since you also raise prices a bit which puts other consumers off).
A lot of other boycott campaigns aren’t like this. For example, take the boycott of products which have been tested on animals. Here you don’t do direct harm with each purchase in the same way (or at least if you do, it is probably orders of magnitude less). Instead, the motivation is that if enough people start acting like this, it will lead to policy change.
In the first case, it doesn’t matter if no one else in the world agrees with you, participating in the boycott can still do significant good. In the second case, a large number of people are required in order for it to have meaningful impact. It makes sense that impact minded EAs are more inclined to support a boycott of the first kind.
I think a lot of your examples probably fall under the second kind (though not all). And I think that’s a big part of the answer to your question. Also, for at least some of the ones in the first kind, I think most EAs probably just disagree with the fundamental argument. For example, the environmental impact of using LLMs isn’t actually that bad: https://andymasley.substack.com/p/a-cheat-sheet-for-conversations-about.