On Tesla: I don’t think training a special model for expensive test cars makes sense. They’re not investing in a method that’s not going to be scalable. The relevant update will come when AI5 ships (end of this year reportedly), with ~9x the memory. I’d be surprised if they don’t solve it on that hardware.
On the broader point about predictions failing: I think these were mostly failures of economic reasoning more than failures of AI progress. AI has made enormous progress on both translation and radiology imaging. What Hinton and others got wrong wasn’t the capability prediction, it was assuming the job consisted entirely of the task AI was getting good at. Turns out radiologists do more than read images, translators do more than translate sentences, and AI ends up complementary rather than substitutive. Maybe some Jevons paradox at play too. Cheaper translation means more content gets translated, which means more demand for people who can review AI output, catch cultural nuance, and put their name on the final result.
The benchmarks aren’t perfect but they consistently point to rapid progress, from METR time horizons, SWE-bench, GDPval...
On the 90% prediction: my somewhat conservative view is that AI could write 90%+ of production code this year and will next year. But I don’t think this will mean immediate mass unemployment for programmers. The job will initially just shift toward review, specification, and technical direction of AI systems. I think most SWE jobs will look more like the one of a technical PM next year.
Edit:
Customer support chat is another area with some applicability, but results are mixed
On Tesla: I don’t think training a special model for expensive test cars makes sense. They’re not investing in a method that’s not going to be scalable. The relevant update will come when AI5 ships (end of this year reportedly), with ~9x the memory. I’d be surprised if they don’t solve it on that hardware.
Do you mean you think Tesla will immediately solve human-level fully autonomous (i.e. SAE Level 4 or Level 5) driving as soon as they deploy Hardware 5/AI5? Or that it will happen some number of years down the line?
Tesla presumably already has some small number of Hardware 5/AI5 units now. It knows what the specs for Hardware 5/AI5 will be. So, it can train a larger model (or set of models) now for that 10x more powerful hardware. Maybe it has already done so. I would imagine Tesla would want to be already testing the 10x larger model (or models) on the new hardware now, before the new hardware enters mass production.
If the 10x more powerful hardware were sufficient so solve full autonomy, Tesla should be able to demonstrate something impressive now with the new hardware units it presumably already has. Moreover, Tesla is massively incentivized to do so.
I don’t see any strong reason why 10x more powerful hardware or a 10x larger model (or set of models) would be enough to get the 100x or 1,000x or 10,000x or whatever it is boost in performance Tesla’s FSD software needs. The trend in scaling the compute and data used for neural networks is that performance tends to improve by less, proportionally, than the increase in compute and data. So, a 10x increase in compute or model size would tend to get less than a 10x increase in performance.
But if it is true that the 10x powerful hardware is sufficient to solve the remainder of the problem, Tesla would have compelling evidence of that by now, or would easily be able to obtain that evidence. I think Tesla would be eager to show that evidence off if it had it, or knew it could get it.
What Hinton and others got wrong wasn’t the capability prediction, it was assuming the job consisted entirely of the task AI was getting good at. Turns out radiologists do more than read images, translators do more than translate sentences, and AI ends up complementary rather than substitutive.
I’ve seen some studies that have found AI models simply underperform human radiologists, although the results are mixed. More importantly, the results are for clean, simplified benchmarks, but those benchmarks don’t generalize well to real world conditions anyway.
I haven’t spent much time looking into studies on human translation vs. post-LLM machine translation. However, I found one study of GPT-4o, open source LLMs, Google Translate, and DeepL that found (among other things):
LLMs still need to address the issue of overly literal outputs, and a substantial gap remains between LLM and human quality in literary translation, despite the clear advancements of recent models.
Since studies take so long to conduct, write up, and get published, we will tend to see studies lagging behind the latest versions of LLMs. That’s a drawback, but I don’t know of a better way to get this kind of high-quality data and analysis. More up-to-date information, like firm-level data or other economic data, is more recent but doesn’t tell as much about the why.
Consulting firms like McKinsey release data based on interviews with people in management positions at companies; I don’t think they’ve specifically covered radiology or translation, but you might be able to find similar reports for those domains based on interviews. This is another way to get more up-to-date information, but interviews or surveys have drawbacks relative to academic studies.
The benchmarks aren’t perfect but they consistently point to rapid progress, from METR time horizons, SWE-bench, GDPval...
Performance on these benchmarks don’t generalize very well to real world performance. I think “aren’t perfect” is an understatement.
There is much to criticize about the way the METR time horizons graph, specifically, has been interpreted. It’s not clear how much METR is responsible for this interpretation; sometimes people at METR give good caveats, sometimes they don’t. In any case, the graph only says something very narrow and contrived, and it doesn’t necessarily tell us much about how good AI is at coding in a practical, realistic, economic sense (or how good it will be in a year or two).
On the 90% prediction: my somewhat conservative view is that AI could write 90%+ of production code this year and will next year.
I very much doubt AI will write 90% of production code by December 2027. But already, you seem to be pushing out the timeline. You started by saying Dario Amodei was “off by a few months” in his prediction that 90% of code would be AI-written by mid-September 2025. (It’s already been nearly 4 months since then.) Pushing out the timeline into 2027 makes him off by at least 1 year and 3 months. If the timeline is late 2027, then he’s off by at least 2 years.
Re Tesla: My best guess is that they still need 5-20x reliability to match human level, and I don’t entirely rule out that they’ll manage it with AI4. Hard to get good data on this though. It sounds like they were still finalizing the AI5 chip design until quite recently, and I’m not sure it makes sense to spend training budget on a model that can only run on a handful of cars while they’re still hoping to squeeze more out of AI4. There’s likely an inference overhang here. They’ve spent years scaling training data and compute while model size stayed fixed, way, way past the usual theoretical optimal tradeoff point. Lifting the size constraint will probably yield disproportionate gains.
Re the middle stuff: I think we just disagree on how to weigh various evidence from failed predictions (based on narrow models, older models...), various firsthand reports and more recent benchmark results.
Re the 90% prediction: by “conservative” I meant this is like my 5-10th percentile slowest timeline. I’ve heard from a number of SWEs that they’re already basically not writing code, just instructing and reviewing. I’m also uncertain about adoption speed. I’d put it at >50% chance that among SWEs actually using the latest LLMs and tools, AI writes 90%+ of their code in the first half of this year.
I recall Elon Musk once said the goal was to get to an average of one intervention per million miles of driving. I think this is based on the statistic of one crash per 500,000 miles on average.
I believe interventions currently happen more than once per 100 miles on average. If so, and if one intervention per million miles is what Tesla is indeed targeting, then Tesla is more than 10,000x off from its goal.
There are other ways of measuring Tesla’s FSD software’s performance compared to average human driving performance and getting another number. I am skeptical it would be possible to use real, credible numbers and come to the conclusion that Tesla is currently less than 100x away from human-level driving.
I very much doubt that Hardware 5/AI5 is going to provide what it takes for Tesla to achieve SAE Level 4⁄5 autonomy at human-level or better performance, or that Tesla will achieve that goal (in any robust, meaningful sense) within the next 2 years. I still think what I said is true — Tesla, internally, would have evidence of this if it were true (or would be capable of obtaining it), and would be incentivized to show off that evidence.
Andrej Karpathy understands this topic better than almost anyone else in the world, and he is clear that he thinks fully autonomous driving is not solved (at Tesla, Waymo, or elsewhere) and there’s long way to go still. There’s good reason to listen to Karpathy on this.
I also very much doubt that the best AI models in 2 years will be capable of writing 90% of commercial, production code, let alone that this will happen within six months. I think there’s essentially no chance of this happening in 2026. As far as I can see, there is no good evidence currently available that would suggest this is starting to happen or should be possible soon. Extrapolating from performance on narrow, contrived benchmark tasks to real world performance is just a mistake. And the evidence about real world use of AI for coding does not support this.
On Tesla: I don’t think training a special model for expensive test cars makes sense. They’re not investing in a method that’s not going to be scalable. The relevant update will come when AI5 ships (end of this year reportedly), with ~9x the memory. I’d be surprised if they don’t solve it on that hardware.
On the broader point about predictions failing: I think these were mostly failures of economic reasoning more than failures of AI progress. AI has made enormous progress on both translation and radiology imaging. What Hinton and others got wrong wasn’t the capability prediction, it was assuming the job consisted entirely of the task AI was getting good at. Turns out radiologists do more than read images, translators do more than translate sentences, and AI ends up complementary rather than substitutive. Maybe some Jevons paradox at play too. Cheaper translation means more content gets translated, which means more demand for people who can review AI output, catch cultural nuance, and put their name on the final result.
The benchmarks aren’t perfect but they consistently point to rapid progress, from METR time horizons, SWE-bench, GDPval...
On the 90% prediction: my somewhat conservative view is that AI could write 90%+ of production code this year and will next year. But I don’t think this will mean immediate mass unemployment for programmers. The job will initially just shift toward review, specification, and technical direction of AI systems. I think most SWE jobs will look more like the one of a technical PM next year.
Edit:
That study uses a fine-tuned GPT-3 model.
Do you mean you think Tesla will immediately solve human-level fully autonomous (i.e. SAE Level 4 or Level 5) driving as soon as they deploy Hardware 5/AI5? Or that it will happen some number of years down the line?
Tesla presumably already has some small number of Hardware 5/AI5 units now. It knows what the specs for Hardware 5/AI5 will be. So, it can train a larger model (or set of models) now for that 10x more powerful hardware. Maybe it has already done so. I would imagine Tesla would want to be already testing the 10x larger model (or models) on the new hardware now, before the new hardware enters mass production.
If the 10x more powerful hardware were sufficient so solve full autonomy, Tesla should be able to demonstrate something impressive now with the new hardware units it presumably already has. Moreover, Tesla is massively incentivized to do so.
I don’t see any strong reason why 10x more powerful hardware or a 10x larger model (or set of models) would be enough to get the 100x or 1,000x or 10,000x or whatever it is boost in performance Tesla’s FSD software needs. The trend in scaling the compute and data used for neural networks is that performance tends to improve by less, proportionally, than the increase in compute and data. So, a 10x increase in compute or model size would tend to get less than a 10x increase in performance.
But if it is true that the 10x powerful hardware is sufficient to solve the remainder of the problem, Tesla would have compelling evidence of that by now, or would easily be able to obtain that evidence. I think Tesla would be eager to show that evidence off if it had it, or knew it could get it.
I’ve seen some studies that have found AI models simply underperform human radiologists, although the results are mixed. More importantly, the results are for clean, simplified benchmarks, but those benchmarks don’t generalize well to real world conditions anyway.
I haven’t spent much time looking into studies on human translation vs. post-LLM machine translation. However, I found one study of GPT-4o, open source LLMs, Google Translate, and DeepL that found (among other things):
Since studies take so long to conduct, write up, and get published, we will tend to see studies lagging behind the latest versions of LLMs. That’s a drawback, but I don’t know of a better way to get this kind of high-quality data and analysis. More up-to-date information, like firm-level data or other economic data, is more recent but doesn’t tell as much about the why.
Consulting firms like McKinsey release data based on interviews with people in management positions at companies; I don’t think they’ve specifically covered radiology or translation, but you might be able to find similar reports for those domains based on interviews. This is another way to get more up-to-date information, but interviews or surveys have drawbacks relative to academic studies.
Performance on these benchmarks don’t generalize very well to real world performance. I think “aren’t perfect” is an understatement.
There is much to criticize about the way the METR time horizons graph, specifically, has been interpreted. It’s not clear how much METR is responsible for this interpretation; sometimes people at METR give good caveats, sometimes they don’t. In any case, the graph only says something very narrow and contrived, and it doesn’t necessarily tell us much about how good AI is at coding in a practical, realistic, economic sense (or how good it will be in a year or two).
I very much doubt AI will write 90% of production code by December 2027. But already, you seem to be pushing out the timeline. You started by saying Dario Amodei was “off by a few months” in his prediction that 90% of code would be AI-written by mid-September 2025. (It’s already been nearly 4 months since then.) Pushing out the timeline into 2027 makes him off by at least 1 year and 3 months. If the timeline is late 2027, then he’s off by at least 2 years.
Re Tesla: My best guess is that they still need 5-20x reliability to match human level, and I don’t entirely rule out that they’ll manage it with AI4. Hard to get good data on this though. It sounds like they were still finalizing the AI5 chip design until quite recently, and I’m not sure it makes sense to spend training budget on a model that can only run on a handful of cars while they’re still hoping to squeeze more out of AI4. There’s likely an inference overhang here. They’ve spent years scaling training data and compute while model size stayed fixed, way, way past the usual theoretical optimal tradeoff point. Lifting the size constraint will probably yield disproportionate gains.
Re the middle stuff: I think we just disagree on how to weigh various evidence from failed predictions (based on narrow models, older models...), various firsthand reports and more recent benchmark results.
Re the 90% prediction: by “conservative” I meant this is like my 5-10th percentile slowest timeline. I’ve heard from a number of SWEs that they’re already basically not writing code, just instructing and reviewing. I’m also uncertain about adoption speed. I’d put it at >50% chance that among SWEs actually using the latest LLMs and tools, AI writes 90%+ of their code in the first half of this year.
Where do you get that 5-20x figure from?
I recall Elon Musk once said the goal was to get to an average of one intervention per million miles of driving. I think this is based on the statistic of one crash per 500,000 miles on average.
I believe interventions currently happen more than once per 100 miles on average. If so, and if one intervention per million miles is what Tesla is indeed targeting, then Tesla is more than 10,000x off from its goal.
There are other ways of measuring Tesla’s FSD software’s performance compared to average human driving performance and getting another number. I am skeptical it would be possible to use real, credible numbers and come to the conclusion that Tesla is currently less than 100x away from human-level driving.
I very much doubt that Hardware 5/AI5 is going to provide what it takes for Tesla to achieve SAE Level 4⁄5 autonomy at human-level or better performance, or that Tesla will achieve that goal (in any robust, meaningful sense) within the next 2 years. I still think what I said is true — Tesla, internally, would have evidence of this if it were true (or would be capable of obtaining it), and would be incentivized to show off that evidence.
Andrej Karpathy understands this topic better than almost anyone else in the world, and he is clear that he thinks fully autonomous driving is not solved (at Tesla, Waymo, or elsewhere) and there’s long way to go still. There’s good reason to listen to Karpathy on this.
I also very much doubt that the best AI models in 2 years will be capable of writing 90% of commercial, production code, let alone that this will happen within six months. I think there’s essentially no chance of this happening in 2026. As far as I can see, there is no good evidence currently available that would suggest this is starting to happen or should be possible soon. Extrapolating from performance on narrow, contrived benchmark tasks to real world performance is just a mistake. And the evidence about real world use of AI for coding does not support this.