Super cool! Great to see others digging into the costs of Agent performance. I agree that more people should be looking into this.
I’m particularly interested in predicting the growth of costs for Agentic AI safety evaluations. So I was wondering if you had any takes on this given this recent series. Here are a few more specific questions along those lines for you, Toby
- Given the cost trends you’ve identified, do you expect the costs of running agents to take up an increasing share of the total costs of AI safety evaluations (including researcher costs)? - Which dynamics do you think will drive how the costs of AI safety evaluations change over the next few years? - Any thoughts on under what conditions it would be better to elicit the maximum capabilities of models using a few very expensive safety evaluations, or better to prioritise a larger quantity of evaluations that get close to plateau performance (i.e. hitting that sweet spot where their hourly cost / performance is lowest, or alternatively their saturation point)? Presumably a mix is a best, but how do we determine what a good mix looks like? What might you recommend to an AI Lab’s Safety/Preparedness team? I’m thinking about how this might inform evaluation requirements for AI labs.
Many thanks for the excellent series! You have a knack for finding elegant and intuitive ways to explain the trends from the data. Despite knowing this data well, I feel like I learn something new with every post. Looking forward to the next thing.
I was only able to get weak evidence of a noisy trend from the limited METR data, so it is hard to draw many conclusions from that. Moreover, METR’s desire to measure the exponentially growing length of useful work tasks is potentially more exposed to an exponential rise in compute costs than more safety-related tasks. But overall, I’d think that the year-on-year growth in the amount of useful compute you can use on safety evaluations is probably growing faster than one can sustainably grow the number of staff.
I’m not sure how the dynamics will shake out for safety evals over a few years. e.g. a lot of recent capability gain has come from RL, which I think isn’t sustainable, and I also think the growth in ability via inference compute will limit both ability to serve the model and ability for people to afford it, so I suspect we’ll see some returning to eke what they can out of more pretraining. i.e. that the reasoning era saw them shift to finding capabilities in new areas with comparably low costs of scaling, but once they reach the optimal mix, we’ll see a mixture of all three going forward. So the future might look a bit less like 2025 and more like a mix of that and 2022-24.
Unfortunately I don’t have much insight on the question of ideal mixes of safety evals!
Super cool! Great to see others digging into the costs of Agent performance. I agree that more people should be looking into this.
I’m particularly interested in predicting the growth of costs for Agentic AI safety evaluations. So I was wondering if you had any takes on this given this recent series. Here are a few more specific questions along those lines for you, Toby
- Given the cost trends you’ve identified, do you expect the costs of running agents to take up an increasing share of the total costs of AI safety evaluations (including researcher costs)?
- Which dynamics do you think will drive how the costs of AI safety evaluations change over the next few years?
- Any thoughts on under what conditions it would be better to elicit the maximum capabilities of models using a few very expensive safety evaluations, or better to prioritise a larger quantity of evaluations that get close to plateau performance (i.e. hitting that sweet spot where their hourly cost / performance is lowest, or alternatively their saturation point)? Presumably a mix is a best, but how do we determine what a good mix looks like? What might you recommend to an AI Lab’s Safety/Preparedness team? I’m thinking about how this might inform evaluation requirements for AI labs.
Many thanks for the excellent series! You have a knack for finding elegant and intuitive ways to explain the trends from the data. Despite knowing this data well, I feel like I learn something new with every post. Looking forward to the next thing.
Thanks Paolo,
I was only able to get weak evidence of a noisy trend from the limited METR data, so it is hard to draw many conclusions from that. Moreover, METR’s desire to measure the exponentially growing length of useful work tasks is potentially more exposed to an exponential rise in compute costs than more safety-related tasks. But overall, I’d think that the year-on-year growth in the amount of useful compute you can use on safety evaluations is probably growing faster than one can sustainably grow the number of staff.
I’m not sure how the dynamics will shake out for safety evals over a few years. e.g. a lot of recent capability gain has come from RL, which I think isn’t sustainable, and I also think the growth in ability via inference compute will limit both ability to serve the model and ability for people to afford it, so I suspect we’ll see some returning to eke what they can out of more pretraining. i.e. that the reasoning era saw them shift to finding capabilities in new areas with comparably low costs of scaling, but once they reach the optimal mix, we’ll see a mixture of all three going forward. So the future might look a bit less like 2025 and more like a mix of that and 2022-24.
Unfortunately I don’t have much insight on the question of ideal mixes of safety evals!