Thanks so much Mo! I am tempted to make the following updates already—does this seem roughly right? Or is this still too high?
Token usage at 8 hrs centered on 5M tokens, with an upper limit closer to 100M. The reasoning for the
Upper range of 100M being that more complex tasks (assuming those from the study you quoted were low hanging fruits) might push this higher (as indicated by the compiler example), while
efficiency gains might push lower, it already seems that from METR’s GPT-5.1-Codex-Max work <6 months ago it might, and this is very, very crude, be going lower.
Token price centered at $1 per million tokens, instead of $5. I could make this even lower as $1 might show a downward trend, but at the same time this low price seems more to be due to cache tokens which I had ignored in my analysis—the input and output tokens still seem priced at roughly the price I found
At the same time, I also feel like these numbers might still be too high—especially token price. The reason is that the super helpful links you sent point at pretty steep downward trends on token cost and point well taken on cache tokens being much cheaper.
(I’m not at all an expert on any of this, please discount appropriately)
Agree with reasoning for directional adjustment and bounds, magnitude-wise seems a bit overcorrected? SemiAnalysis’ figures roughly suggest 15M center. But you’re on track to becoming correct given token efficiency trends anyhow
I wish I had a more empirically-grounded sense of how token usage varies by type of task, fixing task duration at 8 hours for a human professional (that you’d pay $400/day for, say). My guess from comparing model vs human jaggedness (e.g. this) is that leadership-level / early-employee / entrepreneurial / high-context / taste-heavy work would require way more tokens to get 8 hours of work done than the routine analyst-type / junior SWE etc tasks typical of benchmarks
My sense is global average cost per token will go down a lot due to the following, but very unclear as to the mix
a key driver of inference demand going forward being very cache tokens-heavy agentic workflows
a rising share of demand being satisficing not maximising w.r.t output quality for ever-growing task share (e.g. plan with Opus → code with Sonnet or even DeepSeek models at 1-2 OOM cheaper price point)
Thanks so much Mo! I am tempted to make the following updates already—does this seem roughly right? Or is this still too high?
Token usage at 8 hrs centered on 5M tokens, with an upper limit closer to 100M. The reasoning for the
Upper range of 100M being that more complex tasks (assuming those from the study you quoted were low hanging fruits) might push this higher (as indicated by the compiler example), while
efficiency gains might push lower, it already seems that from METR’s GPT-5.1-Codex-Max work <6 months ago it might, and this is very, very crude, be going lower.
Token price centered at $1 per million tokens, instead of $5. I could make this even lower as $1 might show a downward trend, but at the same time this low price seems more to be due to cache tokens which I had ignored in my analysis—the input and output tokens still seem priced at roughly the price I found
At the same time, I also feel like these numbers might still be too high—especially token price. The reason is that the super helpful links you sent point at pretty steep downward trends on token cost and point well taken on cache tokens being much cheaper.
(I’m not at all an expert on any of this, please discount appropriately)
Agree with reasoning for directional adjustment and bounds, magnitude-wise seems a bit overcorrected? SemiAnalysis’ figures roughly suggest 15M center. But you’re on track to becoming correct given token efficiency trends anyhow
I wish I had a more empirically-grounded sense of how token usage varies by type of task, fixing task duration at 8 hours for a human professional (that you’d pay $400/day for, say). My guess from comparing model vs human jaggedness (e.g. this) is that leadership-level / early-employee / entrepreneurial / high-context / taste-heavy work would require way more tokens to get 8 hours of work done than the routine analyst-type / junior SWE etc tasks typical of benchmarks
My sense is global average cost per token will go down a lot due to the following, but very unclear as to the mix
a key driver of inference demand going forward being very cache tokens-heavy agentic workflows
a rising share of demand being satisficing not maximising w.r.t output quality for ever-growing task share (e.g. plan with Opus → code with Sonnet or even DeepSeek models at 1-2 OOM cheaper price point)
race to the bottom pricing wars (DeepSeek again)