What scale is the METR benchmark on? I see a line that “Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours.”, but is the 0% point on the scale 0 hours?
METR does not think that 8 human hours is sufficient autonomy for takeover; in fact 40 hours is our working lower bound.
See the gpt-5 report. “Working lower bound” is maybe too strong; maybe it’s more accurate to describe it as an initial guess at a warning threshold for rogue replication and 10x uplift (if we can even measure time horizons that long). I don’t know what the exact reasoning behind 40 hours was, but one fact is that humans can’t really start viable companies using plans that only take a ~week of work. IMO if AIs could do the equivalent with only a 40 human hour time horizon and continuously evade detection, they’d need to use their own advantages and have made up many current disadvantages relative to humans (like being bad at adversarial and multi-agent settings).
Indeed the 0%point is zero hours, so compared to the METR plot it is divided by 8 hours.
The 8 hours I agree is somewhat arbitrary and I had missed that METR had a more ‘official’ stance on it. I made an issue out of it now to see if anyone else had reasons to make it 8 hours.
(For context I did most of the benchmark literature review for this project and data collection.)
Edit (29 Jan 2026): The change to 40 hour normalization is now live!
What scale is the METR benchmark on? I see a line that “Scores are normalized such that 100% represents a 50% success rate on tasks requiring 8 human-expert hours.”, but is the 0% point on the scale 0 hours?
METR does not think that 8 human hours is sufficient autonomy for takeover; in fact 40 hours is our working lower bound.
METR has an official internal view on what time horizons correspond to “takeover not ruled out”?
See the gpt-5 report. “Working lower bound” is maybe too strong; maybe it’s more accurate to describe it as an initial guess at a warning threshold for rogue replication and 10x uplift (if we can even measure time horizons that long). I don’t know what the exact reasoning behind 40 hours was, but one fact is that humans can’t really start viable companies using plans that only take a ~week of work. IMO if AIs could do the equivalent with only a 40 human hour time horizon and continuously evade detection, they’d need to use their own advantages and have made up many current disadvantages relative to humans (like being bad at adversarial and multi-agent settings).
Indeed the 0%point is zero hours, so compared to the METR plot it is divided by 8 hours.
The 8 hours I agree is somewhat arbitrary and I had missed that METR had a more ‘official’ stance on it. I made an issue out of it now to see if anyone else had reasons to make it 8 hours.
(For context I did most of the benchmark literature review for this project and data collection.)
Edit (29 Jan 2026):
The change to 40 hour normalization is now live!
Could you please explain your reasoning on 40 hours?