Thank you for writing this criticism! I did give it a read, and I shared some of your concerns around the framing and geopolitical stance that the piece takes.
Regarding the OOM issue, you ask:
Order of magnitude of what? Compute? Effective compute? Capabilities?
I’ll excerpt the following from the “count the OOMs” section of the essay:
We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:
Compute: We’re using much bigger computers to train these models.
Algorithmic efficiencies: There’s a continuous trend of algorithmic progress. Many of these act as “compute multipliers,” and we can put them on a unified scale of growing effective compute. ”
Unhobbling” gains: By default, models learn a lot of amazing raw capabilities, but they are hobbled in all sorts of dumb ways, limiting their practical value. With simple algorithmic improvements like reinforcement learning from human feedback (RLHF), chain-of-thought (CoT), tools, and scaffolding, we can unlock significant latent capabilities.
We can “count the OOMs” of improvement along these axes: that is, trace the scaleup for each in units of effective compute. 3x is 0.5 OOMs; 10x is 1 OOM; 30x is 1.5 OOMs; 100x is 2 OOMs; and so on. We can also look at what we should expect on top of GPT-4, from 2023 to 2027.
It’s clear to me what Aschenbrenner is referring to when he says “OOMs” — it’s orders of magnitude scaleups in the three things he mentions here. Compute (measured in training FLOP), algorithmic efficiencies (measured by looking at what fraction of training FLOP is needed to achieve comparable capabilities following algorithmic improvements), and unhobbling (as measured, or rather estimated, by what scaleup in training FLOP would have provided equivalent performance improvements to what was provided by the unhobbling). I’ll grant you, as does he, that unhobbling is hand-wavy and hard to measure (although that by no means implies it isn’t real).
You could still take issue with other questions —as you do — including how strong the relationship is between compute and capabilities, or how well we can measure capabilities in the first place. But we can certainly measure floating point operations! So accusing him of using “OOMs” as a unit, and one that is unmeasurable/detached from reality, surprises me.
Also, speaking of the “compute-capabilities relationship” point, you write:
The general argument seems to be that increasing the first two “OOMs”, i.e. increasing compute and improving algorithms, the AI capabilities will also increase. Interestingly, most of the examples given are actually counterexamples to this argument.
This surprised me as well since I took the fact that capabilities have improved with model scaling to be pretty incontrovertible. You give an example:
There are two image generation examples (Sora and GANs). In both examples, the images become clearer and have higher resolution as compute is increased or better algorithms are developed. This is framed as evidence for the claim that capabilities increase as “OOMs” increase. But this is clearly not the case: only the fidelity of these narrow-AI systems increase, not their capabilities.
I think I might see where the divergence between our reactions is. To me, capabilities for an image model means roughly “the capability to generate a clear, high-quality image depicting the prompt.” As you admit, that has improved with scale. I think this definition probably best reflects common usage in the field, so I do think it supports his argument. And, I personally think that there are deeper capabilities being unlocked, too — for example, in the case of Sora, the capability of understanding (at least the practical implications of) object permanence and gravity and reflections. But I think others would be more inclined to disagree with that.
Yeah, with the word “capability” I meant completely new capabilities (in Aschenbrenner’s case, the relevant new capabilities would be general intelligence abilities such as the learning-planning ability), but I can see that for example object permanence could be called a new capability. I maybe should have used a better word there. Basically, my argument is that while the image generators have become better at generating images, they haven’t gotten anything that would take them nearer towards AGI.
I’ll grant you, as does he, that unhobbling is hand-wavy and hard to measure (although that by no means implies it isn’t real).
I’m not claiming that unhobbling isn’t real, and I think that the mentioned improvements such as CoT and scaffolding etc. really do make models better. But do they make them exponentially better? Can we expect the increases to continue exponentially in the future? I’m going to say no. So I think it’s unsubstantiated to measure them with orders of magnitude.
But we can certainly measure floating point operations! So accusing him of using “OOMs” as a unit, and one that is unmeasurable/detached from reality, surprises me.
Most of the time, when he says “OOM”, he doesn’t refer to FLOPs, he refers to the abstract OOMs that somehow encompass all three axes he mentioned. So while some of it is measurable, as a whole it is not.
Thank you for writing this criticism! I did give it a read, and I shared some of your concerns around the framing and geopolitical stance that the piece takes.
Regarding the OOM issue, you ask:
I’ll excerpt the following from the “count the OOMs” section of the essay:
It’s clear to me what Aschenbrenner is referring to when he says “OOMs” — it’s orders of magnitude scaleups in the three things he mentions here. Compute (measured in training FLOP), algorithmic efficiencies (measured by looking at what fraction of training FLOP is needed to achieve comparable capabilities following algorithmic improvements), and unhobbling (as measured, or rather estimated, by what scaleup in training FLOP would have provided equivalent performance improvements to what was provided by the unhobbling). I’ll grant you, as does he, that unhobbling is hand-wavy and hard to measure (although that by no means implies it isn’t real).
You could still take issue with other questions —as you do — including how strong the relationship is between compute and capabilities, or how well we can measure capabilities in the first place. But we can certainly measure floating point operations! So accusing him of using “OOMs” as a unit, and one that is unmeasurable/detached from reality, surprises me.
Also, speaking of the “compute-capabilities relationship” point, you write:
This surprised me as well since I took the fact that capabilities have improved with model scaling to be pretty incontrovertible. You give an example:
I think I might see where the divergence between our reactions is. To me, capabilities for an image model means roughly “the capability to generate a clear, high-quality image depicting the prompt.” As you admit, that has improved with scale. I think this definition probably best reflects common usage in the field, so I do think it supports his argument. And, I personally think that there are deeper capabilities being unlocked, too — for example, in the case of Sora, the capability of understanding (at least the practical implications of) object permanence and gravity and reflections. But I think others would be more inclined to disagree with that.
Yeah, with the word “capability” I meant completely new capabilities (in Aschenbrenner’s case, the relevant new capabilities would be general intelligence abilities such as the learning-planning ability), but I can see that for example object permanence could be called a new capability. I maybe should have used a better word there. Basically, my argument is that while the image generators have become better at generating images, they haven’t gotten anything that would take them nearer towards AGI.
I’m not claiming that unhobbling isn’t real, and I think that the mentioned improvements such as CoT and scaffolding etc. really do make models better. But do they make them exponentially better? Can we expect the increases to continue exponentially in the future? I’m going to say no. So I think it’s unsubstantiated to measure them with orders of magnitude.
Most of the time, when he says “OOM”, he doesn’t refer to FLOPs, he refers to the abstract OOMs that somehow encompass all three axes he mentioned. So while some of it is measurable, as a whole it is not.