Some great new analysis by Gus Hamilton shows that AI agents probably don’t obey a constant hazard rate / half-life after all. Instead their hazard rates systematically decline as the task goes on.
This means that their success rates on tasks beyond their 50%-horizon are better than the simple model suggests, but those for tasks shorter than the 50% horizon are worse.
I had suggested a constant hazard rate was a good starting assumption for how their success rate at tasks decays with longer durations. It is the simplest model and fits the data OK. But Gus used the standard second-simplest model from survival analysis (the Weibull distribution rather than the exponential distribution). It has a second parameter, K, which represents how the hazard rate changes with time (if at all). If K=1, there is a constant hazard rate, so the exponential distribution is a special case of the Weibull. But if K<1, then hazard decreases over time (like the Lindy effect), and if it is greater, hazard increases (like aging).
Gus found that the estimated values for K were below 1 for all the models, showing that *all* of them had decreasing hazard rates.
A distribution that generalises another is always going to get a better fit of the data than my exponential distribution, so fit alone wouldn’t be decisive. But the way that every single model has K statistically significantly below 1 convinces me he is right.
So what does this mean?
One thing is that it gives very different estimated success rates for tasks much shorter or longer than the 50% horizon (which METR focuses on because it is easier to reliably estimate). e.g. use the Weibull to estimate the 99% horizon (or 10% horizon).
Another thing is that the AI agents mainly have a K of about 0.6, while the human value of K is significantly lower, at about 0.4. This means even if they have the same 50% horizon, humans can do better on really long tasks (and worse on really short ones).
As this shows, for a fixed 50% horizon length, it isn’t clearly better or worse to have a lower value of K. Lower values are good for the low success rate really long tasks, but worse for the high reliability thresholds.
As a word of warning, I was quite sure before METR released its Opus 4.5 results that it was going to have a more human-like value of K, since it had a great 50% horizon, but only an average showing at 80%. But the estimates are that its value of K is similar. I’m not sure why that is, but it might be due to the fact that there isn’t much data to go on here and things are quite noisy for any individual model.
So, from Gus’s results, it still looks like there is some important gap between how human success rates drop off at longer tasks versus how AI agents do.
Gus also compares his two-parameter Weibull model of the data to METR’s two-parameter log-logistic model. He finds that they are similar, but with the log-logistic fitting slightly better. So it isn’t clear which of these to use if you have the choice. They differ quite a lot in the tails of the distribution (i.e. in estimated success rates for very short or very long tasks). e.g. the Weibull says the 99% horizon is 1/20th as long as the log-logistic predicts. That’s a big deal and the data doesn’t tell us which to favour! I’d slightly favour the Weibull, on the grounds that it is more plausible ex ante. But maybe the bigger lesson is that it is unknown which is right, and thus the 99% horizons (necessary for much useful work) are deeply uncertain.
Some great new analysis by Gus Hamilton shows that AI agents probably don’t obey a constant hazard rate / half-life after all. Instead their hazard rates systematically decline as the task goes on.
This means that their success rates on tasks beyond their 50%-horizon are better than the simple model suggests, but those for tasks shorter than the 50% horizon are worse.
I had suggested a constant hazard rate was a good starting assumption for how their success rate at tasks decays with longer durations. It is the simplest model and fits the data OK. But Gus used the standard second-simplest model from survival analysis (the Weibull distribution rather than the exponential distribution). It has a second parameter, K, which represents how the hazard rate changes with time (if at all). If K=1, there is a constant hazard rate, so the exponential distribution is a special case of the Weibull. But if K<1, then hazard decreases over time (like the Lindy effect), and if it is greater, hazard increases (like aging).
Gus found that the estimated values for K were below 1 for all the models, showing that *all* of them had decreasing hazard rates.
A distribution that generalises another is always going to get a better fit of the data than my exponential distribution, so fit alone wouldn’t be decisive. But the way that every single model has K statistically significantly below 1 convinces me he is right.
So what does this mean?
One thing is that it gives very different estimated success rates for tasks much shorter or longer than the 50% horizon (which METR focuses on because it is easier to reliably estimate). e.g. use the Weibull to estimate the 99% horizon (or 10% horizon).
Another thing is that the AI agents mainly have a K of about 0.6, while the human value of K is significantly lower, at about 0.4. This means even if they have the same 50% horizon, humans can do better on really long tasks (and worse on really short ones).
As this shows, for a fixed 50% horizon length, it isn’t clearly better or worse to have a lower value of K. Lower values are good for the low success rate really long tasks, but worse for the high reliability thresholds.
As a word of warning, I was quite sure before METR released its Opus 4.5 results that it was going to have a more human-like value of K, since it had a great 50% horizon, but only an average showing at 80%. But the estimates are that its value of K is similar. I’m not sure why that is, but it might be due to the fact that there isn’t much data to go on here and things are quite noisy for any individual model.
So, from Gus’s results, it still looks like there is some important gap between how human success rates drop off at longer tasks versus how AI agents do.
Gus also compares his two-parameter Weibull model of the data to METR’s two-parameter log-logistic model. He finds that they are similar, but with the log-logistic fitting slightly better. So it isn’t clear which of these to use if you have the choice. They differ quite a lot in the tails of the distribution (i.e. in estimated success rates for very short or very long tasks). e.g. the Weibull says the 99% horizon is 1/20th as long as the log-logistic predicts. That’s a big deal and the data doesn’t tell us which to favour! I’d slightly favour the Weibull, on the grounds that it is more plausible ex ante. But maybe the bigger lesson is that it is unknown which is right, and thus the 99% horizons (necessary for much useful work) are deeply uncertain.