Thanks for writing this James, I think you raise good points, and for the most part I think you’re accurately representing what we say in the piece. Some scattered thoughts —
I agree that you can’t blithely assume scaling trends will continue, or that scaling “laws” will continue to hold. I don’t think both assumptions are entirely unmotivated, because the trends have already spanned many orders of magnitude. E.g. pretraining compute has spanned ≈ 1e8 OOMs of FLOP since the introduction of the transformer, and scaling “laws” for given measures of performance hold up across almost as many OOMs, experimentally and practically. Presumably the trends have to break down eventually, but if all you knew were the trends so far, it seems reasonable to spread your guesses over many future OOMs.
Of course we know more than big-picture trends. As you say, GPT 4.5 seemed pretty disappointing, on benchmarks and just qualitatively. People at labs are making noises about pretraining scaling as we know it slowing down. I do think short timelines[1] look less likely than they did 6 months or a year ago. Progress is often made by a series of overlapping s-curves, and the source of future gains could come from elsewhere, like inference scaling. But we don’t give specific reasons to expect that in the piece.
On the idea of a “software-only intelligence explosion”, this piece by Daniel Eth and Tom Davidson gives more detailed considerations which are specific to ML R&D. It’s not totally obvious to me why you should expect returns to diminish faster in broad vs narrow domains. In very narrow domains (e.g. checkers-playing programs), you might intuitively think that the ‘well’ of creative ways to improve performance is shallower, so further improvements beyond the most obvious are less useful. That said, I’m not strongly sold on the “software-only intelligence explosion” idea; I think it’s not at all guaranteed (maybe I’m 40% on it?) but worth having on the radar.
I also agree that benchmarks don’t (necessarily) generalise well to real-world tasks. In fact I think this is one of the cruxes between people who expect AI progress to quickly transform the world in various ways. I do think the cognitive capabilities of frontier LLMs can meaningfully be described as “broad” in the sense that: if I tried to think of a long list of tasks and questions which span an intuitively broad range of “cognitive skills” before knowing about the skill distribution of LLMs[2] then I expect the best LLMs to perform well across the large majority of those tasks and questions.
But I agree there is a gap between impressive benchmark scores and real-world usefulness. Partly there’s a kind of o-ring dynamic, where the speed of progress is gated by the slowest-moving critical processes — speeding up tasks that normally take 20% of a researcher’s time by 1,000x won’t speed them up by more than 25% if the other tasks can’t be substituted. Partly there are specific competences which AI models don’t have, like reliably carrying out long-range tasks. Partly (and relatedly) there’s an issue of access to implicit knowledge and other bespoke training data (like examples of expert practitioners carrying out long time-horizon tasks) which are scarce in the public internet training data. And so on.
Note that (unless we badly misphrased something) we are not arguing for “AGI by 2030” in the piece you are discussing. As always, it depends on what “AGI” means, but if it means a world where tech progress is advancing across the board at 10x rates, or a time where there is basically no cognitive skill where any human still outperforms some AI, then I also think that’s unlikely by 2030.
One thing I should emphasise, and maybe we should have emphasised more, is the possibility of faster (say 10x) growth rates from continued AI progress. Where I think the best accounts of why this could happen do not depend on AI becoming superhuman in some very comprehensive way, or on a software-only intelligence explosion happening. The main thing we were trying to argue is that sustained AI progress could significantly accelerate tech progress and growth, after the point where (if it ever happens) the contribution to research from AI reaches some kind of parity with humans. I do think this is plausible, though not overwhelmingly likely.
Thanks for writing this James, I think you raise good points, and for the most part I think you’re accurately representing what we say in the piece. Some scattered thoughts —
I agree that you can’t blithely assume scaling trends will continue, or that scaling “laws” will continue to hold. I don’t think both assumptions are entirely unmotivated, because the trends have already spanned many orders of magnitude. E.g. pretraining compute has spanned ≈ 1e8 OOMs of FLOP since the introduction of the transformer, and scaling “laws” for given measures of performance hold up across almost as many OOMs, experimentally and practically. Presumably the trends have to break down eventually, but if all you knew were the trends so far, it seems reasonable to spread your guesses over many future OOMs.
Of course we know more than big-picture trends. As you say, GPT 4.5 seemed pretty disappointing, on benchmarks and just qualitatively. People at labs are making noises about pretraining scaling as we know it slowing down. I do think short timelines[1] look less likely than they did 6 months or a year ago. Progress is often made by a series of overlapping s-curves, and the source of future gains could come from elsewhere, like inference scaling. But we don’t give specific reasons to expect that in the piece.
On the idea of a “software-only intelligence explosion”, this piece by Daniel Eth and Tom Davidson gives more detailed considerations which are specific to ML R&D. It’s not totally obvious to me why you should expect returns to diminish faster in broad vs narrow domains. In very narrow domains (e.g. checkers-playing programs), you might intuitively think that the ‘well’ of creative ways to improve performance is shallower, so further improvements beyond the most obvious are less useful. That said, I’m not strongly sold on the “software-only intelligence explosion” idea; I think it’s not at all guaranteed (maybe I’m 40% on it?) but worth having on the radar.
I also agree that benchmarks don’t (necessarily) generalise well to real-world tasks. In fact I think this is one of the cruxes between people who expect AI progress to quickly transform the world in various ways. I do think the cognitive capabilities of frontier LLMs can meaningfully be described as “broad” in the sense that: if I tried to think of a long list of tasks and questions which span an intuitively broad range of “cognitive skills” before knowing about the skill distribution of LLMs[2] then I expect the best LLMs to perform well across the large majority of those tasks and questions.
But I agree there is a gap between impressive benchmark scores and real-world usefulness. Partly there’s a kind of o-ring dynamic, where the speed of progress is gated by the slowest-moving critical processes — speeding up tasks that normally take 20% of a researcher’s time by 1,000x won’t speed them up by more than 25% if the other tasks can’t be substituted. Partly there are specific competences which AI models don’t have, like reliably carrying out long-range tasks. Partly (and relatedly) there’s an issue of access to implicit knowledge and other bespoke training data (like examples of expert practitioners carrying out long time-horizon tasks) which are scarce in the public internet training data. And so on.
Note that (unless we badly misphrased something) we are not arguing for “AGI by 2030” in the piece you are discussing. As always, it depends on what “AGI” means, but if it means a world where tech progress is advancing across the board at 10x rates, or a time where there is basically no cognitive skill where any human still outperforms some AI, then I also think that’s unlikely by 2030.
One thing I should emphasise, and maybe we should have emphasised more, is the possibility of faster (say 10x) growth rates from continued AI progress. Where I think the best accounts of why this could happen do not depend on AI becoming superhuman in some very comprehensive way, or on a software-only intelligence explosion happening. The main thing we were trying to argue is that sustained AI progress could significantly accelerate tech progress and growth, after the point where (if it ever happens) the contribution to research from AI reaches some kind of parity with humans. I do think this is plausible, though not overwhelmingly likely.
Thanks again for writing this.
That is, “AGI by 2030” or some other fixed calendar date.
Things like memorisation, reading comprehension, abstract reasoning, general knowledge, etc.