I was only able to get weak evidence of a noisy trend from the limited METR data, so it is hard to draw many conclusions from that. Moreover, METR’s desire to measure the exponentially growing length of useful work tasks is potentially more exposed to an exponential rise in compute costs than more safety-related tasks. But overall, I’d think that the year-on-year growth in the amount of useful compute you can use on safety evaluations is probably growing faster than one can sustainably grow the number of staff.
I’m not sure how the dynamics will shake out for safety evals over a few years. e.g. a lot of recent capability gain has come from RL, which I think isn’t sustainable, and I also think the growth in ability via inference compute will limit both ability to serve the model and ability for people to afford it, so I suspect we’ll see some returning to eke what they can out of more pretraining. i.e. that the reasoning era saw them shift to finding capabilities in new areas with comparably low costs of scaling, but once they reach the optimal mix, we’ll see a mixture of all three going forward. So the future might look a bit less like 2025 and more like a mix of that and 2022-24.
Unfortunately I don’t have much insight on the question of ideal mixes of safety evals!
Thanks Paolo,
I was only able to get weak evidence of a noisy trend from the limited METR data, so it is hard to draw many conclusions from that. Moreover, METR’s desire to measure the exponentially growing length of useful work tasks is potentially more exposed to an exponential rise in compute costs than more safety-related tasks. But overall, I’d think that the year-on-year growth in the amount of useful compute you can use on safety evaluations is probably growing faster than one can sustainably grow the number of staff.
I’m not sure how the dynamics will shake out for safety evals over a few years. e.g. a lot of recent capability gain has come from RL, which I think isn’t sustainable, and I also think the growth in ability via inference compute will limit both ability to serve the model and ability for people to afford it, so I suspect we’ll see some returning to eke what they can out of more pretraining. i.e. that the reasoning era saw them shift to finding capabilities in new areas with comparably low costs of scaling, but once they reach the optimal mix, we’ll see a mixture of all three going forward. So the future might look a bit less like 2025 and more like a mix of that and 2022-24.
Unfortunately I don’t have much insight on the question of ideal mixes of safety evals!