Remember that this is graphing the length of task that the AI can do with an over 50% success rate. The length of task that an AI can do reliably is much shorter than what is shown here (you can look at figure 4 in the paper): for an 80% success rate it’s 30 seconds to a minute.
Being able to do a months work of work at a 50% success rate would be very useful and productivity boosting, of course, but it would really be close to recursive self improvement? I don’t think so. I feel that some part of complex projects needs reliable code, and that will always be a bottleneck.
Figure four averages across all models. I think figure six is more illuminating:
Basically, the 80% threshold is ~2 doublings behind the 50% threshold, or ~1 year. An extra year isn’t nothing! But you’re still not getting to 10+ year timelines.
The more task lengths the 80% threshold has to run through before it gets to task length we’d regard as AGI complete though, the more different the tasks at the end of the sequence are from the beginning, and therefore the more likely it is that the doubling trend will break down somewhere along the length of the sequence. That seems to me like the main significance of titotal’s point, not the time gained if we just assume the current 80% doubling trend will continue right to the end of the line. Plausibly 30 seconds to minute long tasks are more different from weeks long tasks than 15 minute tasks are.
Fair enough! My guess is that when the trend breaks it will be because things have gone super-exponential rather than sub-exponential (some discussion here) but yeah, I agree that this could happen!
Just so people know what you’re referring to, this is Figure 4:
Ben West noted in the blog post that
We think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people’s day-to-day work: the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long.
Remember that this is graphing the length of task that the AI can do with an over 50% success rate. The length of task that an AI can do reliably is much shorter than what is shown here (you can look at figure 4 in the paper): for an 80% success rate it’s 30 seconds to a minute.
Being able to do a months work of work at a 50% success rate would be very useful and productivity boosting, of course, but it would really be close to recursive self improvement? I don’t think so. I feel that some part of complex projects needs reliable code, and that will always be a bottleneck.
Figure four averages across all models. I think figure six is more illuminating:
Basically, the 80% threshold is ~2 doublings behind the 50% threshold, or ~1 year. An extra year isn’t nothing! But you’re still not getting to 10+ year timelines.
The more task lengths the 80% threshold has to run through before it gets to task length we’d regard as AGI complete though, the more different the tasks at the end of the sequence are from the beginning, and therefore the more likely it is that the doubling trend will break down somewhere along the length of the sequence. That seems to me like the main significance of titotal’s point, not the time gained if we just assume the current 80% doubling trend will continue right to the end of the line. Plausibly 30 seconds to minute long tasks are more different from weeks long tasks than 15 minute tasks are.
So the claim is:
The 50% trend will break down at some length of task T
The 80% trend will therefore break at T/4
And maybe T is large enough to cause some catastrophic risk, but T/4 isn’t
?
Yes. (Though I’m not saying this will happen, just that it could, and that is more significant than a short delay.)
Fair enough! My guess is that when the trend breaks it will be because things have gone super-exponential rather than sub-exponential (some discussion here) but yeah, I agree that this could happen!
Just so people know what you’re referring to, this is Figure 4:
Ben West noted in the blog post that