The Scaling Series Discussion Thread: with Toby Ord
Weāre trying something a bit new this week. Over the last year, Toby Ord has been writing about the implications of the fact that improvements in AI require exponentially more compute. Only one of these posts so far has been put on the EA forum.
This week weāve put the entire series on the Forum and made this thread for you to discuss your reactions to the posts. Toby Ord will check in once a day to respond to your comments[1].
Feel free to also comment directly on the individual posts that make up this sequence, but you can treat this as a central discussion space for both general takes and more specific questions.
If you havenāt read the series yetā¦
Read it here...or choose a post to start with:
Are the Costs of AI Agents Also Rising Exponentially?
Agents can do longer and longer tasks, but their dollar cost to do these tasks may be growing even faster.
How Well Does RL Scale?
I show that RL-training for LLMs scales much worse than inference or pre-training.
Evidence that Recent AI Gains are Mostly from Inference-Scaling
I show how most of the recent AI gains in reasoning come from spending much more compute every time the model is run.
The Extreme Inefficiency of RL for Frontier Models
The new RL scaling paradigm for AI reduces the amount of information a model could learn per hour of training by a factor of 1,000 to 1,000,000. What follows?
Is There a Half-Life for the Success Rates of AI Agents?
The declining success rates of AI agents on longer-duration tasks can be explained by a simple mathematical model ā a constant rate of failing during each minute a human would take to do the task.
Inference Scaling Reshapes AI Governance
The shift towards inference scaling may mean the end of an era for AI governance. I explore the many consequences.
Inference Scaling and the Log-x Chart
The new trend to scaling up inference compute in AI has come hand-in-hand with an unusual new type of chart that can be highly misleading.
The Scaling Paradox
The scaling up of frontier AI models has been a huge success. But the scaling laws that inspired it actually show extremely poor returns to scale. Whatās going on?
- ^
Heās not committing to respond to every comment.
Super cool! Great to see others digging into the costs of Agent performance. I agree that more people should be looking into this.
Iām particularly interested in predicting the growth of costs for Agentic AI safety evaluations. So I was wondering if you had any takes on this given this recent series. Here are a few more specific questions along those lines for you, Toby
- Given the cost trends youāve identified, do you expect the costs of running agents to take up an increasing share of the total costs of AI safety evaluations (including researcher costs)?
- Which dynamics do you think will drive how the costs of AI safety evaluations change over the next few years?
- Any thoughts on under what conditions it would be better to elicit the maximum capabilities of models using a few very expensive safety evaluations, or better to prioritise a larger quantity of evaluations that get close to plateau performance (i.e. hitting that sweet spot where their hourly cost /ā performance is lowest, or alternatively their saturation point)? Presumably a mix is a best, but how do we determine what a good mix looks like? What might you recommend to an AI Labās Safety/āPreparedness team? Iām thinking about how this might inform evaluation requirements for AI labs.
Many thanks for the excellent series! You have a knack for finding elegant and intuitive ways to explain the trends from the data. Despite knowing this data well, I feel like I learn something new with every post. Looking forward to the next thing.
Thanks Paolo,
I was only able to get weak evidence of a noisy trend from the limited METR data, so it is hard to draw many conclusions from that. Moreover, METRās desire to measure the exponentially growing length of useful work tasks is potentially more exposed to an exponential rise in compute costs than more safety-related tasks. But overall, Iād think that the year-on-year growth in the amount of useful compute you can use on safety evaluations is probably growing faster than one can sustainably grow the number of staff.
Iām not sure how the dynamics will shake out for safety evals over a few years. e.g. a lot of recent capability gain has come from RL, which I think isnāt sustainable, and I also think the growth in ability via inference compute will limit both ability to serve the model and ability for people to afford it, so I suspect weāll see some returning to eke what they can out of more pretraining. i.e. that the reasoning era saw them shift to finding capabilities in new areas with comparably low costs of scaling, but once they reach the optimal mix, weāll see a mixture of all three going forward. So the future might look a bit less like 2025 and more like a mix of that and 2022-24.
Unfortunately I donāt have much insight on the question of ideal mixes of safety evals!
Iām excited about this series!
I would be curious what your take is on this blog post from OpenAI, particularly these two graphs:
While their argument is not very precise, I understand them to be saying something like, āSure, itās true that the costs of both inference and training are increasing exponentially. However, the value delivered by these improvements is also increasing exponentially. So the economics check out.ā
A naive interpretation of e.g. the METR graph would disagree: humans are modeled as having a constant hourly wage, so being able to do a task which is 2x as long is precisely 2x as valuable (and therefore canāt offset a >2x increase in compute costs). But this seems like an implausible simplification.
Do we have any evidence on how the value of models changes with their capabilities?
Iām excited about this series!
I would be curious what your take is on this blog post from OpenAI, particularly these two graphs: