SemiAnalysis’ recent newsletter provides some data points on token spend vs labor cost ROIs for actual 1-20 hour tasks.
SemiAnalysis has writtenandtalked extensively about our Claude Code usage, but it is important to emphasize that agentic AI is no longer limited to just coding. Our analysts are using agents every day to convert excel models into dashboards, create charts for all our notes, build financial models and analyze company earnings, and much more. These are all tasks that either 1) we simply wouldn’t have been able to do before or 2) would’ve previously taken our junior analysts many hours, taking them away from far more value added tasks.
The table below shows a handful of real examples from our own workflows, comparing token spend against what the equivalent human labor would have cost:
… We estimate that the true blended price per million tokens for running Opus 4.7 on agentic tasks at $0.99 despite the sticker price being $5/$25 per MTok. Agentic workloads have extremely high input-to-output ratios (our Claude Code usage has a ratio of about 300:1) and high cache hit rates (90%+). Because cached input tokens only cost $0.50/MTok, most of the tokens end up in the cheapest tier. We walk through the full methodology here.
Eyeballing, it looks like 8 hours of analyst-type work costs them $7-30 in Opus 4.7 token spend, so (very roughly) 7-30M tokens at their true blended price of ~$1 per M tokens, in contrast with the post’s 40-1,300M token estimate, and already squarely here. I expect token usage to drop further for a given task with more advanced models, and also to vary a lot depending on (essentially) how much the big AI companies prioritise RLVR-ing them and on model jaggedness, but also for doable tasks to get much more complicated, like this and more.
Epoch BOTEC-ed a related question last year, prior to Claude Code: How many digital workers could OpenAI deploy? My main takeaway was “worker equivalents is probably more misleading than helpful if people just skim headline numbers” (which everyone does, speaking as someone who sometimes needs to produce headline numbers).
On the tasks that AIs are able to perform today, how many “human-equivalent digital workers” could frontier AI labs deploy to work on them?
Based on a speculative back-of-the-envelope calculation, we estimate that companies like OpenAI have the hardware to deploy on the order of 7 million digital workers, with a wide 90% confidence interval of 400,000 to around 300 million.2 This doesn’t mean that OpenAI could do the jobs of 7 million human employees today, because AIs can’t fully substitute for humans. But as AI progress continues, AIs will be able to perform an increasing fraction of the tasks that humans currently do.
Thanks so much Mo! I am tempted to make the following updates already—does this seem roughly right? Or is this still too high?
Token usage at 8 hrs centered on 5M tokens, with an upper limit closer to 100M. The reasoning for the
Upper range of 100M being that more complex tasks (assuming those from the study you quoted were low hanging fruits) might push this higher (as indicated by the compiler example), while
efficiency gains might push lower, it already seems that from METR’s GPT-5.1-Codex-Max work <6 months ago it might, and this is very, very crude, be going lower.
Token price centered at $1 per million tokens, instead of $5. I could make this even lower as $1 might show a downward trend, but at the same time this low price seems more to be due to cache tokens which I had ignored in my analysis—the input and output tokens still seem priced at roughly the price I found
At the same time, I also feel like these numbers might still be too high—especially token price. The reason is that the super helpful links you sent point at pretty steep downward trends on token cost and point well taken on cache tokens being much cheaper.
(I’m not at all an expert on any of this, please discount appropriately)
Agree with reasoning for directional adjustment and bounds, magnitude-wise seems a bit overcorrected? SemiAnalysis’ figures roughly suggest 15M center. But you’re on track to becoming correct given token efficiency trends anyhow
I wish I had a more empirically-grounded sense of how token usage varies by type of task, fixing task duration at 8 hours for a human professional (that you’d pay $400/day for, say). My guess from comparing model vs human jaggedness (e.g. this) is that leadership-level / early-employee / entrepreneurial / high-context / taste-heavy work would require way more tokens to get 8 hours of work done than the routine analyst-type / junior SWE etc tasks typical of benchmarks
My sense is global average cost per token will go down a lot due to the following, but very unclear as to the mix
a key driver of inference demand going forward being very cache tokens-heavy agentic workflows
a rising share of demand being satisficing not maximising w.r.t output quality for ever-growing task share (e.g. plan with Opus → code with Sonnet or even DeepSeek models at 1-2 OOM cheaper price point)
SemiAnalysis’ recent newsletter provides some data points on token spend vs labor cost ROIs for actual 1-20 hour tasks.
Eyeballing, it looks like 8 hours of analyst-type work costs them $7-30 in Opus 4.7 token spend, so (very roughly) 7-30M tokens at their true blended price of ~$1 per M tokens, in contrast with the post’s 40-1,300M token estimate, and already squarely here. I expect token usage to drop further for a given task with more advanced models, and also to vary a lot depending on (essentially) how much the big AI companies prioritise RLVR-ing them and on model jaggedness, but also for doable tasks to get much more complicated, like this and more.
Epoch BOTEC-ed a related question last year, prior to Claude Code: How many digital workers could OpenAI deploy? My main takeaway was “worker equivalents is probably more misleading than helpful if people just skim headline numbers” (which everyone does, speaking as someone who sometimes needs to produce headline numbers).
Thanks so much Mo! I am tempted to make the following updates already—does this seem roughly right? Or is this still too high?
Token usage at 8 hrs centered on 5M tokens, with an upper limit closer to 100M. The reasoning for the
Upper range of 100M being that more complex tasks (assuming those from the study you quoted were low hanging fruits) might push this higher (as indicated by the compiler example), while
efficiency gains might push lower, it already seems that from METR’s GPT-5.1-Codex-Max work <6 months ago it might, and this is very, very crude, be going lower.
Token price centered at $1 per million tokens, instead of $5. I could make this even lower as $1 might show a downward trend, but at the same time this low price seems more to be due to cache tokens which I had ignored in my analysis—the input and output tokens still seem priced at roughly the price I found
At the same time, I also feel like these numbers might still be too high—especially token price. The reason is that the super helpful links you sent point at pretty steep downward trends on token cost and point well taken on cache tokens being much cheaper.
(I’m not at all an expert on any of this, please discount appropriately)
Agree with reasoning for directional adjustment and bounds, magnitude-wise seems a bit overcorrected? SemiAnalysis’ figures roughly suggest 15M center. But you’re on track to becoming correct given token efficiency trends anyhow
I wish I had a more empirically-grounded sense of how token usage varies by type of task, fixing task duration at 8 hours for a human professional (that you’d pay $400/day for, say). My guess from comparing model vs human jaggedness (e.g. this) is that leadership-level / early-employee / entrepreneurial / high-context / taste-heavy work would require way more tokens to get 8 hours of work done than the routine analyst-type / junior SWE etc tasks typical of benchmarks
My sense is global average cost per token will go down a lot due to the following, but very unclear as to the mix
a key driver of inference demand going forward being very cache tokens-heavy agentic workflows
a rising share of demand being satisficing not maximising w.r.t output quality for ever-growing task share (e.g. plan with Opus → code with Sonnet or even DeepSeek models at 1-2 OOM cheaper price point)
race to the bottom pricing wars (DeepSeek again)