I don’t know what to make of that. Obviously Vladimir knows a lot about state of the art compute, but there are so many details there without them being drawn together into a coherent point that really disagrees with you or me on this.
It does sound like he is making the argument that GPT 4.5 was actually fine and on trend. I don’t really believe this, and don’t think OpenAI believed it either (there are various leaks they were disappointed with it, they barely announced it, and then they shelved it almost immediately).
I don’t think the argument about original GPT-4 really works. It improved because of post-training, but did they also add that post-training on GPT-4.5? If so, then the 10x compute really does add little. If not, then why not? Why is OpenAI’s revealed preference to not put much effort into enhancing their most expensive ever system if not because they didn’t think it was that good?
There is a similar story re reasoning models. It is true that in many ways the advanced reasoning versions of GPT-4o (e.g. o3) are superior to GPT-4.5, but why not make it a reasoning model too? If that’s because it would use too much compute or be too slow for users due to latency, then these are big flaws with scaling up larger models.
Shouldn’t we be able to point to some objective benchmark if GPT-4.5 was really off trend? It got 10x the SWE-Bench score of GPT-4. That seems like solid evidence that additional pretraining continued to produce the same magnitude of improvements as previous scaleups. If there were now even more efficient ways than that to improve capabilities, like RL post-training on smaller o-series models, why would you expect OpenAI not to focus their efforts there instead? RL was producing gains and hadn’t been scaled as much as self-supervised pretraining, so it was obvious where to invest marginal dollars. GPT-5 is better and faster than 4.5. This doesn’t mean pretraining suddenly stopped working or went off trend from scaling laws though.
It’s very difficult to do this with benchmarks, because as the models improve benchmarks come and go. Things that used to be so hard that it couldn’t do better than chance quickly become saturated and we look for the next thing, then the one after that, and so on. For me, the fact that GPT-4 → GPT4.5 seemed to involve climbing about half of one benchmark was slower progress than I expected (and the leaks from OpenAI suggest they had similar views to me). When GPT-3.5 was replaced by GPT-4, people were losing their minds about it — both internally and on launch day. Entirely new benchmarks were needed to deal with what it could do. I didn’t see any of that for GPT-4.5.
I agree with you that the evidence is subjective and disputable. But I don’t think it is a case where the burden of proof is disproportionately on those saying it was a smaller jump than previously.
(Also, note that this doesn’t have much to do with the actual scaling laws, which are a measure of how much prediction error of the next token goes down when you 10x the training compute. I don’t have reason to think that has gone off trend. But I’m saying that the real-world gains from this (or the intuitive measure of intelligence) has diminished, compared to the previous few 10x jumps. This is definitely compatible. e.g. if the model only trained on wikipedia plus an unending supply of nursery rhymes, its prediction error would continue to drop as more training happened, but its real world capabilities wouldn’t improve by continued 10x jumps in the number of nursery rhymes added in. I think the real world is like this where GPT-4-level systems are already trained on most books ever written and much of the recorded knowledge of the last 10,000 years of civilisation, and it makes sense that adding more Reddit comments wouldn’t move the needle much.)
Yes, what you are scaling matters just as much as the fact that you are scaling. So now developers are scaling RL post training and pretraining using higher quality synthetic data pipelines. If the point is just that training on average internet text provides diminishing real world returns in many real-world use cases, then that seems defensible; that certainly doesn’t seem to be the main recipe any company is using for pushing the frontier right now. But it seems like people often mistake this for something stronger like “all training is now facing insurmountable barriers to continued real world gains” or “scaling laws are slowing down across the board” or “it didn’t produce significant gains on meaningful tasks so scaling is done.” I mentioned SWE-Bench because that seems to suggest significant real world utility improvements rather than trivial prediction loss decrease. I also don’t think it’s clear that there is such an absolute separation here—to model the data you have to model the world in some sense. If you continue feeding multimodal LLM agents the right data in the right way, they continue improving on real world tasks.
I don’t know what to make of that. Obviously Vladimir knows a lot about state of the art compute, but there are so many details there without them being drawn together into a coherent point that really disagrees with you or me on this.
It does sound like he is making the argument that GPT 4.5 was actually fine and on trend. I don’t really believe this, and don’t think OpenAI believed it either (there are various leaks they were disappointed with it, they barely announced it, and then they shelved it almost immediately).
I don’t think the argument about original GPT-4 really works. It improved because of post-training, but did they also add that post-training on GPT-4.5? If so, then the 10x compute really does add little. If not, then why not? Why is OpenAI’s revealed preference to not put much effort into enhancing their most expensive ever system if not because they didn’t think it was that good?
There is a similar story re reasoning models. It is true that in many ways the advanced reasoning versions of GPT-4o (e.g. o3) are superior to GPT-4.5, but why not make it a reasoning model too? If that’s because it would use too much compute or be too slow for users due to latency, then these are big flaws with scaling up larger models.
Shouldn’t we be able to point to some objective benchmark if GPT-4.5 was really off trend? It got 10x the SWE-Bench score of GPT-4. That seems like solid evidence that additional pretraining continued to produce the same magnitude of improvements as previous scaleups. If there were now even more efficient ways than that to improve capabilities, like RL post-training on smaller o-series models, why would you expect OpenAI not to focus their efforts there instead? RL was producing gains and hadn’t been scaled as much as self-supervised pretraining, so it was obvious where to invest marginal dollars. GPT-5 is better and faster than 4.5. This doesn’t mean pretraining suddenly stopped working or went off trend from scaling laws though.
It’s very difficult to do this with benchmarks, because as the models improve benchmarks come and go. Things that used to be so hard that it couldn’t do better than chance quickly become saturated and we look for the next thing, then the one after that, and so on. For me, the fact that GPT-4 → GPT4.5 seemed to involve climbing about half of one benchmark was slower progress than I expected (and the leaks from OpenAI suggest they had similar views to me). When GPT-3.5 was replaced by GPT-4, people were losing their minds about it — both internally and on launch day. Entirely new benchmarks were needed to deal with what it could do. I didn’t see any of that for GPT-4.5.
I agree with you that the evidence is subjective and disputable. But I don’t think it is a case where the burden of proof is disproportionately on those saying it was a smaller jump than previously.
(Also, note that this doesn’t have much to do with the actual scaling laws, which are a measure of how much prediction error of the next token goes down when you 10x the training compute. I don’t have reason to think that has gone off trend. But I’m saying that the real-world gains from this (or the intuitive measure of intelligence) has diminished, compared to the previous few 10x jumps. This is definitely compatible. e.g. if the model only trained on wikipedia plus an unending supply of nursery rhymes, its prediction error would continue to drop as more training happened, but its real world capabilities wouldn’t improve by continued 10x jumps in the number of nursery rhymes added in. I think the real world is like this where GPT-4-level systems are already trained on most books ever written and much of the recorded knowledge of the last 10,000 years of civilisation, and it makes sense that adding more Reddit comments wouldn’t move the needle much.)
Yes, what you are scaling matters just as much as the fact that you are scaling. So now developers are scaling RL post training and pretraining using higher quality synthetic data pipelines. If the point is just that training on average internet text provides diminishing real world returns in many real-world use cases, then that seems defensible; that certainly doesn’t seem to be the main recipe any company is using for pushing the frontier right now. But it seems like people often mistake this for something stronger like “all training is now facing insurmountable barriers to continued real world gains” or “scaling laws are slowing down across the board” or “it didn’t produce significant gains on meaningful tasks so scaling is done.” I mentioned SWE-Bench because that seems to suggest significant real world utility improvements rather than trivial prediction loss decrease. I also don’t think it’s clear that there is such an absolute separation here—to model the data you have to model the world in some sense. If you continue feeding multimodal LLM agents the right data in the right way, they continue improving on real world tasks.