What do you make of the fact that METR’s time horizon graph and METR’s study on AI coding assistants point in opposite directions? The graph says: exponential progress! Superhuman coders! AGI soon! Singularity! The study says: overhyped product category, useless tool, tricks people into thinking it helps them when it actually hurts them.
Yep, I wouldn’t have predicted that. I guess the standard retort is: Worst case! Existing large codebase! Experienced developers!
I know that there’s software tools I use >once a week that wouldn’t have existed without AI models. They’re not very complicated, but they’d’ve been annoying to code up myself, and I wouldn’t have done it. I wonder if there’s a slowdown in less harsh scenarios, but it’s probably not worth the value of information of running such a study.
I dunno. I’ve done a bunch ofcalibrationpractice[1], this feels like a 30%, I’m calling 30%. My probability went up recently, mostly because some subjectively judged capabilities that I was expecting didn’t start showing up.
My metaculus calibration around 30% isn’t great, I’m overconfident there, I’m trying to keep that in mind. My fatebook is slightly overconfident in that range, and who can tell with Manifold.
There’s a longer discussion of that oft-discussed METR time horizons graph that warrants a post of its own.
My problem with how people interpret the graph is that people slip quickly and wordlessly from step to step in a logical chain of inferences that I don’t think can be justified. The chain of inferences is something like:
AI model performance on a set of very limited benchmark tasks → AI model performance on software engineering in general → AI model performance on everything humans do
What do you make of the fact that METR’s time horizon graph and METR’s study on AI coding assistants point in opposite directions? The graph says: exponential progress! Superhuman coders! AGI soon! Singularity! The study says: overhyped product category, useless tool, tricks people into thinking it helps them when it actually hurts them.
Pretty interesting, no?
Yep, I wouldn’t have predicted that. I guess the standard retort is: Worst case! Existing large codebase! Experienced developers!
I know that there’s software tools I use >once a week that wouldn’t have existed without AI models. They’re not very complicated, but they’d’ve been annoying to code up myself, and I wouldn’t have done it. I wonder if there’s a slowdown in less harsh scenarios, but it’s probably not worth the value of information of running such a study.
I dunno. I’ve done a bunch of calibration practice[1], this feels like a 30%, I’m calling 30%. My probability went up recently, mostly because some subjectively judged capabilities that I was expecting didn’t start showing up.
My metaculus calibration around 30% isn’t great, I’m overconfident there, I’m trying to keep that in mind. My fatebook is slightly overconfident in that range, and who can tell with Manifold.
There’s a longer discussion of that oft-discussed METR time horizons graph that warrants a post of its own.
My problem with how people interpret the graph is that people slip quickly and wordlessly from step to step in a logical chain of inferences that I don’t think can be justified. The chain of inferences is something like:
AI model performance on a set of very limited benchmark tasks → AI model performance on software engineering in general → AI model performance on everything humans do
I don’t think these inferences are justifiable.