The data for the early ‘linear’ regime for these models actually appears to be even better than you suggest here. They have a roughly straight line (on a log-log plot), but at a slope that is better than 1. Eyeballing it, I think some are slope 5 or higher (i.e. increasing returns, with time horizon growing as the 5th power of compute). See my 3rd chart here. If anything, this would strengthen your case for talking about that regime separately from the poorly scaling high compute regime later on.
I’d also suspected that when you apply extra RL to a model (e.g. o3 compared to o1) that it would have a curve that dominated the earlier model. But that doesn’t seem to be the case. See the curves in the final chart here, where o1-preview is dominated, but the other OpenAI models curves all cross each other (being cheaper for the same horizon at some horizons and more expensive at others).
Even when they do dominate each other neatly like in your fake data, I noticed that the ‘sweet spots’ and the ‘saturation points’ can still be getting more expensive, both in terms of $ and in terms of $/hr. I’m not sure what to make of that though!
I think you’re on to something with the idea that there is a problematic kind of inference scaling and a fine kind, though I’m not sure if you’ve quite put your finger on how to distinguish them. I suppose we can definitely talk about the super-linear scaling regime and the sub-linear regime (which meet at what I call the sweet spot), but I’m not sure these are the two types you refer to in qualitative terms near the top.
Note that these METR cost vs time horizon are not at all pareto frontiers. These just correspond to what you get if you cut off the agent early, so they are probably very underelicited for “optimal performance for some cost” (e.g. note that if an agent doesn’t complete some part of the task until it is nearly out of budget it would do much worse on this metric at low cost, see e.g. gpt-5 for which this is true). My guess is that with better elicitation you get closer to the regime I expect.
At some point, METR might run results where they try to elicit performance at lower budgets such that we can actually get a pareto frontier.
I agree my abstraction might not be the right ones and maybe there is a cleaner way to think about this.
Interesting ideas! A few quick responses:
The data for the early ‘linear’ regime for these models actually appears to be even better than you suggest here. They have a roughly straight line (on a log-log plot), but at a slope that is better than 1. Eyeballing it, I think some are slope 5 or higher (i.e. increasing returns, with time horizon growing as the 5th power of compute). See my 3rd chart here. If anything, this would strengthen your case for talking about that regime separately from the poorly scaling high compute regime later on.
I’d also suspected that when you apply extra RL to a model (e.g. o3 compared to o1) that it would have a curve that dominated the earlier model. But that doesn’t seem to be the case. See the curves in the final chart here, where o1-preview is dominated, but the other OpenAI models curves all cross each other (being cheaper for the same horizon at some horizons and more expensive at others).
Even when they do dominate each other neatly like in your fake data, I noticed that the ‘sweet spots’ and the ‘saturation points’ can still be getting more expensive, both in terms of $ and in terms of $/hr. I’m not sure what to make of that though!
I think you’re on to something with the idea that there is a problematic kind of inference scaling and a fine kind, though I’m not sure if you’ve quite put your finger on how to distinguish them. I suppose we can definitely talk about the super-linear scaling regime and the sub-linear regime (which meet at what I call the sweet spot), but I’m not sure these are the two types you refer to in qualitative terms near the top.
Note that these METR cost vs time horizon are not at all pareto frontiers. These just correspond to what you get if you cut off the agent early, so they are probably very underelicited for “optimal performance for some cost” (e.g. note that if an agent doesn’t complete some part of the task until it is nearly out of budget it would do much worse on this metric at low cost, see e.g. gpt-5 for which this is true). My guess is that with better elicitation you get closer to the regime I expect.
At some point, METR might run results where they try to elicit performance at lower budgets such that we can actually get a pareto frontier.
I agree my abstraction might not be the right ones and maybe there is a cleaner way to think about this.
Good point about the METR curves not being Pareto frontiers.