huw comments on METR: Measuring AI Ability to Complete Long Tasks

huw 20 Mar 2025 1:18 UTC
8 points
1 ∶ 1
Reposting this from Daniel Eth:
On the one hand, this seems like not much (shouldn’t AGIs be able to hit ‘escape velocity’ and operate autonomously forever?), but on the other, being able to do a month’s worth of work coherently would surely get us close to recursive self-improvement.
- titotal 20 Mar 2025 10:47 UTC
  9 points
  0 ∶ 1
  Parent
  Remember that this is graphing the length of task that the AI can do with an over 50% success rate. The length of task that an AI can do reliably is much shorter than what is shown here (you can look at figure 4 in the paper): for an 80% success rate it’s 30 seconds to a minute.
  Being able to do a months work of work at a 50% success rate would be very useful and productivity boosting, of course, but it would really be close to recursive self improvement? I don’t think so. I feel that some part of complex projects needs reliable code, and that will always be a bottleneck.
  - Ben_West🔸 20 Mar 2025 15:20 UTC
    20 points
    1 ∶ 0
    Parent
    Figure four averages across all models. I think figure six is more illuminating:
    Basically, the 80% threshold is ~2 doublings behind the 50% threshold, or ~1 year. An extra year isn’t nothing! But you’re still not getting to 10+ year timelines.
    - David Mathers🔸 20 Mar 2025 17:39 UTC
      4 points
      0 ∶ 0
      Parent
      The more task lengths the 80% threshold has to run through before it gets to task length we’d regard as AGI complete though, the more different the tasks at the end of the sequence are from the beginning, and therefore the more likely it is that the doubling trend will break down somewhere along the length of the sequence. That seems to me like the main significance of titotal’s point, not the time gained if we just assume the current 80% doubling trend will continue right to the end of the line. Plausibly 30 seconds to minute long tasks are more different from weeks long tasks than 15 minute tasks are.
      - Ben_West🔸 20 Mar 2025 21:08 UTC
        4 points
        0 ∶ 0
        Parent
        So the claim is:
        The 50% trend will break down at some length of task $T$
        The 80% trend will therefore break at $T / 4$
        And maybe $T$ is large enough to cause some catastrophic risk, but $T / 4$ isn’t
        ?
        David Mathers🔸 20 Mar 2025 21:17 UTC
        4 points
        0 ∶ 0
        Parent
        Yes. (Though I’m not saying this will happen, just that it could, and that is more significant than a short delay.)
        Ben_West🔸 21 Mar 2025 1:09 UTC
        4 points
        1 ∶ 0
        Parent
        Fair enough! My guess is that when the trend breaks it will be because things have gone super-exponential rather than sub-exponential (some discussion here) but yeah, I agree that this could happen!
  - Mo Putera 20 Mar 2025 13:58 UTC
    10 points
    1 ∶ 0
    Parent
    Just so people know what you’re referring to, this is Figure 4:
    Ben West noted in the blog post that
    We think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people’s day-to-day work: the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long.