titotal comments on METR: Measuring AI Ability to Complete Long Tasks

titotal Mar 20, 2025, 10:47 AM
9 points
0 ∶ 1
Remember that this is graphing the length of task that the AI can do with an over 50% success rate. The length of task that an AI can do reliably is much shorter than what is shown here (you can look at figure 4 in the paper): for an 80% success rate it’s 30 seconds to a minute.
Being able to do a months work of work at a 50% success rate would be very useful and productivity boosting, of course, but it would really be close to recursive self improvement? I don’t think so. I feel that some part of complex projects needs reliable code, and that will always be a bottleneck.
- Ben_West🔸Mar 20, 2025, 3:20 PM
  20 points
  1 ∶ 0
  Parent
  Figure four averages across all models. I think figure six is more illuminating:
  Basically, the 80% threshold is ~2 doublings behind the 50% threshold, or ~1 year. An extra year isn’t nothing! But you’re still not getting to 10+ year timelines.
  - David Mathers🔸Mar 20, 2025, 5:39 PM
    4 points
    0 ∶ 0
    Parent
    The more task lengths the 80% threshold has to run through before it gets to task length we’d regard as AGI complete though, the more different the tasks at the end of the sequence are from the beginning, and therefore the more likely it is that the doubling trend will break down somewhere along the length of the sequence. That seems to me like the main significance of titotal’s point, not the time gained if we just assume the current 80% doubling trend will continue right to the end of the line. Plausibly 30 seconds to minute long tasks are more different from weeks long tasks than 15 minute tasks are.
    - Ben_West🔸Mar 20, 2025, 9:08 PM
      4 points
      0 ∶ 0
      Parent
      So the claim is:
      The 50% trend will break down at some length of task $T$
      The 80% trend will therefore break at $T / 4$
      And maybe $T$ is large enough to cause some catastrophic risk, but $T / 4$ isn’t
      ?
      - David Mathers🔸Mar 20, 2025, 9:17 PM
        4 points
        0 ∶ 0
        Parent
        Yes. (Though I’m not saying this will happen, just that it could, and that is more significant than a short delay.)
        Ben_West🔸Mar 21, 2025, 1:09 AM
        4 points
        1 ∶ 0
        Parent
        Fair enough! My guess is that when the trend breaks it will be because things have gone super-exponential rather than sub-exponential (some discussion here) but yeah, I agree that this could happen!
- Mo Putera Mar 20, 2025, 1:58 PM
  10 points
  1 ∶ 0
  Parent
  Just so people know what you’re referring to, this is Figure 4:
  Ben West noted in the blog post that
  We think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people’s day-to-day work: the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long.