Thanks for the comments. The idea that pretraining has slowed/stalled is in the background in many posts in my series and it is unfortunate I didn’t write one where I addressed it head-on. I don’t disagree with Vladimir Nesov as much as you may think. Some of this is that the terms are slippery.
I think there are three things under discussion:
Scaling Laws. The empirical relationship between model size (or training data or compute) and the log-loss when predicting tokens from randomly chosen parts of the same data distribution that it hadn’t trained on.
Training Compute Increases. The annual increase in the amount of compute used to pretrain a frontier model.
Value of scaling. The practical returns from each 10x to the amount of pre-training compute.
In my view, the scaling laws (1) may well hold, and I wouldn’t be surprised if there isn’t even a kink in the curve. This is what Nesov is mainly discussing and I don’t disagree with him about it. My view is that the annual training compute scaleup for frontier models has declined (from more than 10x to less than 3x) and that the value per 10x has also declined (possibly due to having already trained on all books, leaving only marginal gains in Reddit comments etc).
As witness to this, consider that the Epoch estimates for total training of OpenAI’s leading model are just about 2x as high as original GPT-4, released almost 3 years ago. They did have a version (GPT-4.5) that was 10x as high, but were disappointed by it and quickly sunsetted it. xAI has a version with even more than that, but it is only the about the 5th best model and isn’t widely acclaimed, despite having scaled the most.
I see the fact that companies can’t economically serve large pretrained models as part of an explanation for the stalling of pretraining, rather than as a counterargument.
Note that I’m not saying pre-training scaling is dead (or anything about Scaling Laws). I’m saying something more like:
Pretraining scaling has returned substantially fewer practical benefits in the last 2.5 years than people in industry expected and is no longer the determinate of who has the best model. This is at least a temporary slump, and may be permanent. Compared to the days before GPT-4, I think the tailwind for AI progress provided by pre-training scaling has roughly halved.
Finally, I’ll say that I’m talking about scaling as the process of ‘just adding more GPUs’. If smart AI researchers improve pretraining efficiency, then that is great for pretraining even at the same scale, and is not the thing I’m critiquing. It is more like progress driven by ‘just adding more AI researchers’ and has different dynamics to the compute scaling that drove everything in the GPT 1 → GPT 4 era.
Thanks for the comments. The idea that pretraining has slowed/stalled is in the background in many posts in my series and it is unfortunate I didn’t write one where I addressed it head-on. I don’t disagree with Vladimir Nesov as much as you may think. Some of this is that the terms are slippery.
I think there are three things under discussion:
Scaling Laws. The empirical relationship between model size (or training data or compute) and the log-loss when predicting tokens from randomly chosen parts of the same data distribution that it hadn’t trained on.
Training Compute Increases. The annual increase in the amount of compute used to pretrain a frontier model.
Value of scaling. The practical returns from each 10x to the amount of pre-training compute.
In my view, the scaling laws (1) may well hold, and I wouldn’t be surprised if there isn’t even a kink in the curve. This is what Nesov is mainly discussing and I don’t disagree with him about it. My view is that the annual training compute scaleup for frontier models has declined (from more than 10x to less than 3x) and that the value per 10x has also declined (possibly due to having already trained on all books, leaving only marginal gains in Reddit comments etc).
As witness to this, consider that the Epoch estimates for total training of OpenAI’s leading model are just about 2x as high as original GPT-4, released almost 3 years ago. They did have a version (GPT-4.5) that was 10x as high, but were disappointed by it and quickly sunsetted it. xAI has a version with even more than that, but it is only the about the 5th best model and isn’t widely acclaimed, despite having scaled the most.
I see the fact that companies can’t economically serve large pretrained models as part of an explanation for the stalling of pretraining, rather than as a counterargument.
Note that I’m not saying pre-training scaling is dead (or anything about Scaling Laws). I’m saying something more like:
Pretraining scaling has returned substantially fewer practical benefits in the last 2.5 years than people in industry expected and is no longer the determinate of who has the best model. This is at least a temporary slump, and may be permanent. Compared to the days before GPT-4, I think the tailwind for AI progress provided by pre-training scaling has roughly halved.
Finally, I’ll say that I’m talking about scaling as the process of ‘just adding more GPUs’. If smart AI researchers improve pretraining efficiency, then that is great for pretraining even at the same scale, and is not the thing I’m critiquing. It is more like progress driven by ‘just adding more AI researchers’ and has different dynamics to the compute scaling that drove everything in the GPT 1 → GPT 4 era.