Thoughts on Toby Ords AI Scaling Series

I’ve been reading Toby Ord’s recent sequence on AI scaling a bit. General notes come first, then my thoughts.

Notes

The Scaling Paradox basically argues that the scaling laws are actually pretty bad and mean progress will hit a wall fairly quickly unless the next gen or two of models somehow speed up AI research, we find a new scaling paradigm etc...
Inference Scaling and the Log X Chart says that inference is also not a big deal because the scaling is again logarithmic. My intuition here is that this is probably true for widespread adoption of models. It’s probably not true if there are threshold effects where a single $100′000 query can be drastically better than a $100 query and allow you to, say, one shot open research problems. I’m not sure which world we live in.
Inference Scaling Reshapes Governance talks about the implications of inference being a big part of models. One of the implications is that instead of getting a big bang of new model trained ⇒ millions of instances, we get a slower gradual wave of more inference = stronger model with a gradual rightward movement in the curve. Another is that compute thresholds matter less because centralized data centers or single compute runs for training are less important. A third is that inference boosted models may able to help produce synthetic data for the next model iteration or distillation, leading to very rapid progress in some possible worlds.
Is there a Half-Life for the Success Rates of AI Agents? basically argues that AI agent time horizons can best be modeled as having a constant hazard rate
Inefficiency of Reinforcement Learning talks about RL being the new paradigm and being 1′000 − 1′000′000 times less efficient. Question 1: What is RL? Basically in pre-training you predict the next token and it’s right or wrong. In RL you emit a whole chain of answer/reasoning and only then get marked as right wrong. Much less signal per token. Much bigger jumps to make. Toby argues that RL is inefficient and, unlike pretraining, generalize less making it even more costly per unit of general intelligence gain.
Recent AI Gains are Mostly from Inference Scaling is again about how inference scaling is behind much of the improvement in benchmark scores we’ve seen recently
How well does RL scale is similar. Breaking down how far recent improvements are due to RL vs Inference as well as how much scaling you get with RL vs inference for a given amount of compute. The conclusion is basically 10x scaling in RL === 3x scaling in inference.
Hourly Costs for AI Agents argues that much of the progress in agentic benchmarks, like the famous METR time horizon graph, is misleading and the product of drastically higher spending rather than improved performance per dollar. We’re still getting progress, but at a much slower rate than would at first seem reasonable.

Takeaways

I think there are two things I got from this series: a better model of the three phases modern LLM scaling has gone through, and how LLM training works generally, and an argument for longer timelines.

The model of scaling is basically

We start with pre-training (2018 − 2024)
- In pre-training, the model is given a text as input and asked to predict the next token.
- This is pretty efficient (you output 1 token, it’s either correct or incorrect)
- Pre-training seems to make a model generally smarter and more capable in a broad, highly generalizable way. It’s great. We keep doing it until we’ve run through too many orders of magnitude of compute and it becomes uneconomical.
We then do RL (2024)
- In RL, we give the model a specific task where we can evaluate the output (e.g: solve a maths problem, a coding task)
- RL is much less efficient. You still need a bunch of input. The output is often dozens or hundreds of tokens long. You only learn after all the output whether you’re correct and update
- RL is also much more limited in what it teaches the model. It causes a significant improvement in the training domain, but that doesn’t generalize nearly as well as pre-training
- We do RL anyway because, having done a bunch of pre-training, the costs of RL per unit of “improving my model” are low even if the scaling is worse
Around the same time as RL, we also start to do inference (2024)
- With inference, we don’t change the model at all. We just spend more compute to run it harder in various ways (chains of thought, multiple answers and choosing the best one, self-verification). For that specific run, we get a better quality answer.
- This is hideously inefficient. The scaling relationship between inference compute and improved performance is also logarithmic but in addition unlike in RL or pre-training, where you pay the cost once to get the benefit for every future query as you make the base model better, here you pay the full cost for only a single query.
- We do inference a fair bit. It pushes out model performance a bit further. If you spend a large amount of $ you can get your model to perform far better on benchmarks than it will in any real life use case.

This is also a case for, well, not AI risk being that much lower. I think there are actually two key takeaways. One is that the rate of progress we’ve seen recently in major benchmarks doesn’t really reflect the underlying progress in some metric we actually care about like “answer quality per $”. The other is that we’ve hit or are very close to hitting a wall and that the “scaling laws” everyone thinks are a guarantee of future progress are actually pretty much a guarantee of a drastic slowdown and stagnation if they hold.

I buy the first argument. Current benchmark perf is probably slightly inflated and doesn’t really represent “general intelligence” as much as we would assume because of a mix of RL and inference (with the RL possibly chosen specifically to help juice benchmark performance).

I’m not sure how I feel about the second argument. On one hand the core claims seem to be true. The AI Scaling laws do seem to be logarithmic. We have burned through most of the economically feasible orders of magnitude on training. On the other hand someone could have made the same argument in 2023 when pre-training was losing steam. If I’ve learned one thing from my favourite progress studies sources it’s that every large trend line is composed of multiple smaller overlapping S curves. I’m worried that just looking at current approaches hitting economic scaling ceilings could be losing sight of the forest for the trees here. Yes the default result if we do the exact same thing is we hit the scaling wall. But we’ve come up with a new thing twice now and we may well continue to do so. Maybe it’s distillation/synthetic data. Maybe it’s something else. Another thing to bear in mind is that, even assuming no new scaling approaches arise, we’re still getting a roughly 3x per year effective compute increase from algo progress and a 1.4x increase from hardware improvements, meaning a total increase of roughly an order of magnitude every 1.6 years. Even with logarithmic scaling and even assuming AI investment as a % of GDP stabilizes, we should see continued immense growth in capabilities over the next years.