To a large extent, I agree that RL scaling is basically just inference scaling for the most part, but I disagree with this claim immensely, and this causes me to have different expectations of AI progress over the next 4-6 years (but agree in the longer term, absent new paradigms inference scaling will be more important and AI progress will slow back down to the prior compute trend of 1.55x efficiency per year, rather than getting 3-4x more compute every year):
> In the last year or two, the most important trend in modern AI came to an end. The scaling-up of computational resources used to train ever-larger AI models through next-token prediction (pre-training) stalled out.
Vladimir Nesov explains why here in more detail, but the issue here is that the scaling laws were already fairly weak (and probably closer to logarithmic returns than linear returns, meaning that the compute increase from GPT-4 to GPT-4.5 was much closer to 10x than 100x, which means it’s not surprising that people were disappointed in AI progress, since GPT-3 to GPT-4 type progress required 100x compute that will only come online in 2028 and 2030), so we have little evidence that returns have recently gotten worse, especially in a way that suggests that pre-training has stalled.
I think this post is much better viewed as evidence that pre-training isn’t dead, it’s just resting, and that RL will in the near-term account for way less AI progress than pre-training, and that the big scale up of RLVR in 2025-2027 is much more of a one-time boost than a second trend that can progress independently of pre-training.
Thanks for the comments. The idea that pretraining has slowed/stalled is in the background in many posts in my series and it is unfortunate I didn’t write one where I addressed it head-on. I don’t disagree with Vladimir Nesov as much as you may think. Some of this is that the terms are slippery.
I think there are three things under discussion:
Scaling Laws. The empirical relationship between model size (or training data or compute) and the log-loss when predicting tokens from randomly chosen parts of the same data distribution that it hadn’t trained on.
Training Compute Increases. The annual increase in the amount of compute used to pretrain a frontier model.
Value of scaling. The practical returns from each 10x to the amount of pre-training compute.
In my view, the scaling laws (1) may well hold, and I wouldn’t be surprised if there isn’t even a kink in the curve. This is what Nesov is mainly discussing and I don’t disagree with him about it. My view is that the annual training compute scaleup for frontier models has declined (from more than 10x to less than 3x) and that the value per 10x has also declined (possibly due to having already trained on all books, leaving only marginal gains in Reddit comments etc).
As witness to this, consider that the Epoch estimates for total training of OpenAI’s leading model are just about 2x as high as original GPT-4, released almost 3 years ago. They did have a version (GPT-4.5) that was 10x as high, but were disappointed by it and quickly sunsetted it. xAI has a version with even more than that, but it is only the about the 5th best model and isn’t widely acclaimed, despite having scaled the most.
I see the fact that companies can’t economically serve large pretrained models as part of an explanation for the stalling of pretraining, rather than as a counterargument.
Note that I’m not saying pre-training scaling is dead (or anything about Scaling Laws). I’m saying something more like:
Pretraining scaling has returned substantially fewer practical benefits in the last 2.5 years than people in industry expected and is no longer the determinate of who has the best model. This is at least a temporary slump, and may be permanent. Compared to the days before GPT-4, I think the tailwind for AI progress provided by pre-training scaling has roughly halved.
Finally, I’ll say that I’m talking about scaling as the process of ‘just adding more GPUs’. If smart AI researchers improve pretraining efficiency, then that is great for pretraining even at the same scale, and is not the thing I’m critiquing. It is more like progress driven by ‘just adding more AI researchers’ and has different dynamics to the compute scaling that drove everything in the GPT 1 → GPT 4 era.
To a large extent, I agree that RL scaling is basically just inference scaling for the most part, but I disagree with this claim immensely, and this causes me to have different expectations of AI progress over the next 4-6 years (but agree in the longer term, absent new paradigms inference scaling will be more important and AI progress will slow back down to the prior compute trend of 1.55x efficiency per year, rather than getting 3-4x more compute every year):
> In the last year or two, the most important trend in modern AI came to an end. The scaling-up of computational resources used to train ever-larger AI models through next-token prediction (pre-training) stalled out.
Vladimir Nesov explains why here in more detail, but the issue here is that the scaling laws were already fairly weak (and probably closer to logarithmic returns than linear returns, meaning that the compute increase from GPT-4 to GPT-4.5 was much closer to 10x than 100x, which means it’s not surprising that people were disappointed in AI progress, since GPT-3 to GPT-4 type progress required 100x compute that will only come online in 2028 and 2030), so we have little evidence that returns have recently gotten worse, especially in a way that suggests that pre-training has stalled.
I think this post is much better viewed as evidence that pre-training isn’t dead, it’s just resting, and that RL will in the near-term account for way less AI progress than pre-training, and that the big scale up of RLVR in 2025-2027 is much more of a one-time boost than a second trend that can progress independently of pre-training.
Thanks for the comments. The idea that pretraining has slowed/stalled is in the background in many posts in my series and it is unfortunate I didn’t write one where I addressed it head-on. I don’t disagree with Vladimir Nesov as much as you may think. Some of this is that the terms are slippery.
I think there are three things under discussion:
Scaling Laws. The empirical relationship between model size (or training data or compute) and the log-loss when predicting tokens from randomly chosen parts of the same data distribution that it hadn’t trained on.
Training Compute Increases. The annual increase in the amount of compute used to pretrain a frontier model.
Value of scaling. The practical returns from each 10x to the amount of pre-training compute.
In my view, the scaling laws (1) may well hold, and I wouldn’t be surprised if there isn’t even a kink in the curve. This is what Nesov is mainly discussing and I don’t disagree with him about it. My view is that the annual training compute scaleup for frontier models has declined (from more than 10x to less than 3x) and that the value per 10x has also declined (possibly due to having already trained on all books, leaving only marginal gains in Reddit comments etc).
As witness to this, consider that the Epoch estimates for total training of OpenAI’s leading model are just about 2x as high as original GPT-4, released almost 3 years ago. They did have a version (GPT-4.5) that was 10x as high, but were disappointed by it and quickly sunsetted it. xAI has a version with even more than that, but it is only the about the 5th best model and isn’t widely acclaimed, despite having scaled the most.
I see the fact that companies can’t economically serve large pretrained models as part of an explanation for the stalling of pretraining, rather than as a counterargument.
Note that I’m not saying pre-training scaling is dead (or anything about Scaling Laws). I’m saying something more like:
Pretraining scaling has returned substantially fewer practical benefits in the last 2.5 years than people in industry expected and is no longer the determinate of who has the best model. This is at least a temporary slump, and may be permanent. Compared to the days before GPT-4, I think the tailwind for AI progress provided by pre-training scaling has roughly halved.
Finally, I’ll say that I’m talking about scaling as the process of ‘just adding more GPUs’. If smart AI researchers improve pretraining efficiency, then that is great for pretraining even at the same scale, and is not the thing I’m critiquing. It is more like progress driven by ‘just adding more AI researchers’ and has different dynamics to the compute scaling that drove everything in the GPT 1 → GPT 4 era.