This apparently isn’t true for autonomous driving[1] and it’s probably even less true in a lot of other domains. If an AI system can’t respond well to novelty, it can’t function in the world because novelty occurs all the time. For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?
Edited to add on October 20, 2025 at 12:23pm Eastern: Don’t take my word for it. Andrej Karpathy, an AI researcher formerly at OpenAI who led Tesla’s autonomous driving AI from 2017 to 2022, recently said on a podcast that he doesn’t think fully autonomous driving is nearly solved yet:
…self-driving cars are nowhere near done still. The deployments are pretty minimal. Even Waymo and so on has very few cars. … Also, when you look at these cars and there’s no one driving, I actually think it’s a little bit deceiving because there are very elaborate teleoperation centers of people kind of in a loop with these cars. I don’t have the full extent of it, but there’s more human-in-the-loop than you might expect. There are people somewhere out there beaming in from the sky. I don’t know if they’re fully in the loop with the driving. Some of the time they are, but they’re certainly involved and there are people. In some sense, we haven’t actually removed the person, we’ve moved them to somewhere where you can’t see them.
Safety is only one component of overall driving competence. A parked car is 100% safe. Even if it is true that autonomous cars are safer than human drivers, they aren’t as competent as human drivers overall.
Incidentally, I’m pretty familiar with the autonomous driving industry and I’ve spent countless hours looking into such claims. I even once paid someone with a PhD in a relevant field to help me analyze some data to try to come to a conclusion. (The result was there wasn’t enough data to draw a conclusion.) What I’ve found is that autonomous driving companies are incredibly secretive about the data they keep on safety and other kinds of driving performance. They have aggressive PR and marketing, but they won’t actually publish the data that will allow third-parties to independently audit how safe their AI vehicles are.
Besides just not having the data, there are the additional complications of 1) aggressive geofencing to artificially constrain the problem and make it easier (just like a parked car is 100% safe, a car slowly circling a closed track would also be almost 100% safe) and 2) humans in the loop, either physically inside the car or remotely.[1]
The most important thing to know is that you can’t trust these companies’ PR and marketing. Autonomous vehicle companies will be happy to say their cars are superhuman right up until the day they announce they’re shutting down. It’s like Soviet propagandists saying communism is going great in 1988. But also, no, you can’t look at their economic data.
I’ll have to look at that safety report later and see what the responses are to it. At a glance, this seems to be a bigger and more rigorous disclosure than what I’ve seen previously and Waymo has taken the extra step of publishing in a journal.
[Edit, added on October 20, 2025 at 12:40pm Eastern: There are probably going to be limitations with any safety data and we shouldn’t expect perfection, nor should that get in the way of us lauding companies for being more open with their safety data. However, just one thing to think about: if autonomous vehicles are geofenced to safer areas but they’re being compared to humans driving in all areas, ranging from the safest to the most dangerous, then this isn’t a strict apples-to-apples comparison.]
However, I’m not ready to jump to any conclusions just yet because it was a similar report by Waymo (not published in a journal, however) that I paid someone with a PhD in a relevant field to help me analyze and, despite Waymo’s report initially looking promising and interesting to me, that person’s conclusion was that there was not enough data to actually make a determination one way or the other whether Waymo’s autonomous vehicles were actually safer than the average human driver.
I was coming at that report from the perspective of wanting it to show that Waymo’s vehicles were safer than human drivers (although I didn’t tell the person with the PhD that because I didn’t want to bias them). I was disappointed that the result was inconclusive.
If it turns out Waymo’s autonomous vehicles are indeed safer than the average human driver, I would celebrate that. Sadly, however, it would not really make me feel more than marginally more optimistic about the near-term prospects of autonomous vehicle technology for widespread commercialization.
The bigger problem for this overall argument about autonomous vehicles (that they show data efficiency or the ability to deal with novelty isn’t important) is that safety is only one component of competence (as I said, a parked car is 100% safe) and autonomous vehicles are not as competent as human drivers overall. If they were, there would be a huge commercial opportunity in automating human driving in a widespread fashion — by some estimations, possibly the largest commercial opportunity in the history of capitalism. The reason this can’t be done is not regulatory or social or anything like that. It’s because the technology simply can’t do the job.
The technology as it’s deployed today is not only helped along by geofencing, it’s also supported by a high ratio of human labour to the amount of autonomous driving. That’s not only safety drivers in the car or remote monitors and operators, but also engineers doing a lot of special casing for specific driving environments.
If you want to use autonomous vehicles as an example of AI automating significant human labour, first they would have to automate significant human labour — practically, not just in theory — but that hasn’t happened yet.
Moreover, driving should, at least in theory, be a low bar. Driving is considered to be routine, boring, repetitive, not particularly complex — exactly the sort of thing we would think should be easier to automate. So, if approaches to AI that have low data efficiency and don’t deal well with novelty can’t even handle driving, then it stands to reason that more complex forms of human labour such as science, philosophy, journalism, politics, economics, management, social work, and so on would be even less susceptible to automation by these approaches.
Just to be clear on this point: if we had a form of AI that could drive cars, load dishwashers, and work an assembly line but not do those other things (like science, etc.), I think that would be wonderful and it would certainly be economically transformative, but it wouldn’t be AGI.
Edited to add on October 20, 2025 at 12:30pm Eastern:
Don’t take my word for it. Andrej Karpathy, an AI researcher formerly at OpenAI who led Tesla’s autonomous driving AI from 2017 to 2022, recently said on a podcast that he doesn’t think fully autonomous driving is nearly solved yet:
…self-driving cars are nowhere near done still. The deployments are pretty minimal. Even Waymo and so on has very few cars. … Also, when you look at these cars and there’s no one driving, I actually think it’s a little bit deceiving because there are very elaborate teleoperation centers of people kind of in a loop with these cars. I don’t have the full extent of it, but there’s more human-in-the-loop than you might expect. There are people somewhere out there beaming in from the sky. I don’t know if they’re fully in the loop with the driving. Some of the time they are, but they’re certainly involved and there are people. In some sense, we haven’t actually removed the person, we’ve moved them to somewhere where you can’t see them.
For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?
The bar is much lower because they are 100x faster and 1000x cheaper than me. They open up a bunch of brute forceable techniques in the same way that you can open up https://projecteuler.net/ solve many of eulers discoveries with little math knowledge but basic python and for loops.
Math → re read every arxiv paper → translate them all into lean → aggregate every open well specificied math problem → use the database of all previous learnings to see if you can chain chunks of previous problems together to solve.
clinical medicine → re-read every RCT ever done and comprehensively rank intervention effectiveness by disease → find cost data where available and rank the cost/qaly of all disease/intervention space
Econometrics → aggregate every natural experiment and instrumental variable ever used in an econometrics paper → think about other use cases for these tools → search if other use cases have available data → reapply the general theory of the original paper with the new data.
First, why do you think LLMs haven’t already done of any of these things?
Second, even if LLMs could do these things, they couldn’t automate all of human labour, and this isn’t an argument that they could. This is an argument that LLMs could do some really useful things, not that they could do all the useful things that human workers do.
Unless, I guess, if you think there’s no such thing as something so novel it can’t be understood by LLMs based on existing knowledge, but then this would be equivalent to arguing that LLMs have or will have a very high level of data efficiency.
i’m fleshing out nunos point a bit. Basically AI have so many systematic advantages with their cost/speed/seemless integration into the digital world that they can afford to be worse than humans at a variety of things and still automate (most/all/some) work. Just as a plane doesn’t need to flap it’s wings. Of course I wasn’t saying I solved automating the economy. I’m just showing you ways in which something lacking some top level human common sense/iq/whatever could replace still.
FWIW I basically disagree with every point you made in the summary. This mostly just comes from using these tools every day and getting utility out of them + seeing how fast they are improving + seeing how many different routes there are to improvement (i was quite skeptical a year ago, not so anymore). But I wanted to keep the argument contained and isolate a point of disagreement.
I want to try to separate out a few different ideas because I worry they might get confused together.
Are actual existing LLMs good at discovering novel ideas? No. They haven’t discovered anything useful in any domain yet. They haven’t come up with any interesting new idea in science, math, economics, medicine, or anything.
Could LLMs eventually discover novel ideas in the way you described? I don’t think so. I think you’re saying you think this will happen. Okay, so, why? What are LLMs missing now that they will have in, say, 5 years that will mean they make the jump from zero novel ideas to lots of novel ideas? Is it just scale?
Would an AI system that can’t learn new ideas from one example or a few examples count as AGI? No, I don’t think so.
Would an AI system that can’t learn new ideas from one example or a few examples be able to automate all human labour? No, I don’t think so because this kind of learning is part of many different jobs, such as scientist, philosopher, and journalist, and also taxi driver (per the above point about autonomous vehicles).
I do use ChatGPT every day and find it to be a useful tool for what I use it for, which is mainly a form of search engine. I used ChatGPT when it first launched, as well as GPT-4 when it first launched, and have been following the progress.
Everything is relative to expectations. If I’m judging ChatGPT based on the expectations of a typical consumer tech product, or even a cool AI science experiment, then I do find the progress impressive. On the other hand, if I’m judging ChatGPT as a potential precursor to AGI, I don’t find the progress particularly impressive.
I guess I don’t see the potential routes to improvement that you see. The ones that I’ve seen discussed don’t strike me as promising.
https://x.com/slow_developer/status/1979157947529023997 I would bet a lot of money you are going to see exactly what I described for math in the next two years. The capabilities literally just exploded. It took us like 20 years to start using the lightbulb but you are expecting results from products that came out in the last few weeks/months.
I can also confidently say because I am working on a project with doctors that the work I described for clinical medicine is being tested and happening right now. It’s exact usefulness remains to be seen but like people are trying exactly what I described, there will be some lag as people need to learn how to use the tools best and then distribute their results.
Again, I don’t think most of this stuff was particularly useful with the tools available to use >1 year ago.
>Would an AI system that can’t learn new ideas from one example or a few examples count as AGI?
The math example you cited doesn’t seem to an example of an LLM coming up with a novel idea in math. It just sounds like mathematicians are using an LLM as a search tool. I agree that LLMs are really useful for search, but this is a far cry from an LLM actually coming up with a novel idea itself.
The point you raise about LLMs doing in-context learning is ably discussed the video I embedded in the post.
“novel idea” means almost nothing to me. A math proof is simply a->b. It doesn’t matter how you figure out a->b. If you can figure it out by reading 16 million papers and clicking them together that still counts. There are many ways to cook an egg.
I don’t think the LLMs in this case are clicking them together. Rather, it seems like the LLMs are being used as a search tool for human mathematicians who are clicking them together.
If you could give the LLM a prompt along the lines of, “Read the mathematics literature and come up with some new proofs based on that,” and it could do it, then I would count that as an LLM successfully coming up with a proof, and with a novel idea.
Based on the tweets you linked to, what seems to be happening is that the LLMs are being used as a search tool like Google Scholar, and it’s the mathematicians coming up with the proofs, not the search engine.
Sure that’s a fair point. I’d guess I hope you would feel at least a little pushed in the direction after this thread that AIs need not take a similar route to humans to automating large amounts of our current work.
LLMs may have some niches in which they enhance productivity, such as by serving as an advanced search engine or text search tool for mathematicians. This is quite different than AGI and quite different from either:
a) LLMs having a broad impact on productivity across the economy (which would not necessarily amount to AGI but which would be economically significant)
or
b) LLMs fully automating jobs by acting autonomously and doing hierarchical planning over very long time horizons (which is the sort of thing AGI would have to be capable of doing to meet the conventional definition of AGI).
If you want to argue LLMs will get from their current state where they can’t do (a) or (b) to a state where they will be able to do (a) and/or (b), then I think you have to address my arguments in the post about LLMs’ apparent fundamental weaknesses (e.g. the Tower of Hanoi example seems stark to me) and what I said about the obstacles to scaling LLMs further (e.g. Epoch AI estimates that data may run out around 2028).
This apparently isn’t true for autonomous driving[1] and it’s probably even less true in a lot of other domains. If an AI system can’t respond well to novelty, it can’t function in the world because novelty occurs all the time. For example, how can AI automate the labour of scientists, philosophers, and journalists if it can’t understand novel ideas?
Edited to add on October 20, 2025 at 12:23pm Eastern: Don’t take my word for it. Andrej Karpathy, an AI researcher formerly at OpenAI who led Tesla’s autonomous driving AI from 2017 to 2022, recently said on a podcast that he doesn’t think fully autonomous driving is nearly solved yet:
For autonomous driving, current approaches which “can’t deal with novelty” are already far safer than human drivers.
Safety is only one component of overall driving competence. A parked car is 100% safe. Even if it is true that autonomous cars are safer than human drivers, they aren’t as competent as human drivers overall.
Incidentally, I’m pretty familiar with the autonomous driving industry and I’ve spent countless hours looking into such claims. I even once paid someone with a PhD in a relevant field to help me analyze some data to try to come to a conclusion. (The result was there wasn’t enough data to draw a conclusion.) What I’ve found is that autonomous driving companies are incredibly secretive about the data they keep on safety and other kinds of driving performance. They have aggressive PR and marketing, but they won’t actually publish the data that will allow third-parties to independently audit how safe their AI vehicles are.
Besides just not having the data, there are the additional complications of 1) aggressive geofencing to artificially constrain the problem and make it easier (just like a parked car is 100% safe, a car slowly circling a closed track would also be almost 100% safe) and 2) humans in the loop, either physically inside the car or remotely.[1]
The most important thing to know is that you can’t trust these companies’ PR and marketing. Autonomous vehicle companies will be happy to say their cars are superhuman right up until the day they announce they’re shutting down. It’s like Soviet propagandists saying communism is going great in 1988. But also, no, you can’t look at their economic data.
Edited on October 20, 2025 at 12:35pm Eastern to add: See the footnote added to my comment above for Andrej Karpathy’s recent comments on this.
You’re right, they made the problem easier with geofencing, but the data from Waymo isn’t ambiguous, and despite your previous investigations, is now published https://storage.googleapis.com/waymo-uploads/files/documents/safety/Safety%20Impact%20Crash%20Type%20Manuscript.pdf
This example makes it clear that the approach works to automate significant human labor, with some investment, without solving AGI.
I’ll have to look at that safety report later and see what the responses are to it. At a glance, this seems to be a bigger and more rigorous disclosure than what I’ve seen previously and Waymo has taken the extra step of publishing in a journal.
[Edit, added on October 20, 2025 at 12:40pm Eastern: There are probably going to be limitations with any safety data and we shouldn’t expect perfection, nor should that get in the way of us lauding companies for being more open with their safety data. However, just one thing to think about: if autonomous vehicles are geofenced to safer areas but they’re being compared to humans driving in all areas, ranging from the safest to the most dangerous, then this isn’t a strict apples-to-apples comparison.]
However, I’m not ready to jump to any conclusions just yet because it was a similar report by Waymo (not published in a journal, however) that I paid someone with a PhD in a relevant field to help me analyze and, despite Waymo’s report initially looking promising and interesting to me, that person’s conclusion was that there was not enough data to actually make a determination one way or the other whether Waymo’s autonomous vehicles were actually safer than the average human driver.
I was coming at that report from the perspective of wanting it to show that Waymo’s vehicles were safer than human drivers (although I didn’t tell the person with the PhD that because I didn’t want to bias them). I was disappointed that the result was inconclusive.
If it turns out Waymo’s autonomous vehicles are indeed safer than the average human driver, I would celebrate that. Sadly, however, it would not really make me feel more than marginally more optimistic about the near-term prospects of autonomous vehicle technology for widespread commercialization.
The bigger problem for this overall argument about autonomous vehicles (that they show data efficiency or the ability to deal with novelty isn’t important) is that safety is only one component of competence (as I said, a parked car is 100% safe) and autonomous vehicles are not as competent as human drivers overall. If they were, there would be a huge commercial opportunity in automating human driving in a widespread fashion — by some estimations, possibly the largest commercial opportunity in the history of capitalism. The reason this can’t be done is not regulatory or social or anything like that. It’s because the technology simply can’t do the job.
The technology as it’s deployed today is not only helped along by geofencing, it’s also supported by a high ratio of human labour to the amount of autonomous driving. That’s not only safety drivers in the car or remote monitors and operators, but also engineers doing a lot of special casing for specific driving environments.
If you want to use autonomous vehicles as an example of AI automating significant human labour, first they would have to automate significant human labour — practically, not just in theory — but that hasn’t happened yet.
Moreover, driving should, at least in theory, be a low bar. Driving is considered to be routine, boring, repetitive, not particularly complex — exactly the sort of thing we would think should be easier to automate. So, if approaches to AI that have low data efficiency and don’t deal well with novelty can’t even handle driving, then it stands to reason that more complex forms of human labour such as science, philosophy, journalism, politics, economics, management, social work, and so on would be even less susceptible to automation by these approaches.
Just to be clear on this point: if we had a form of AI that could drive cars, load dishwashers, and work an assembly line but not do those other things (like science, etc.), I think that would be wonderful and it would certainly be economically transformative, but it wouldn’t be AGI.
Edited to add on October 20, 2025 at 12:30pm Eastern:
Don’t take my word for it. Andrej Karpathy, an AI researcher formerly at OpenAI who led Tesla’s autonomous driving AI from 2017 to 2022, recently said on a podcast that he doesn’t think fully autonomous driving is nearly solved yet:
The bar is much lower because they are 100x faster and 1000x cheaper than me. They open up a bunch of brute forceable techniques in the same way that you can open up https://projecteuler.net/ solve many of eulers discoveries with little math knowledge but basic python and for loops.
Math → re read every arxiv paper → translate them all into lean → aggregate every open well specificied math problem → use the database of all previous learnings to see if you can chain chunks of previous problems together to solve.
clinical medicine → re-read every RCT ever done and comprehensively rank intervention effectiveness by disease → find cost data where available and rank the cost/qaly of all disease/intervention space
Econometrics → aggregate every natural experiment and instrumental variable ever used in an econometrics paper → think about other use cases for these tools → search if other use cases have available data → reapply the general theory of the original paper with the new data.
I’m not sure if I understand what you’re arguing.
First, why do you think LLMs haven’t already done of any of these things?
Second, even if LLMs could do these things, they couldn’t automate all of human labour, and this isn’t an argument that they could. This is an argument that LLMs could do some really useful things, not that they could do all the useful things that human workers do.
Unless, I guess, if you think there’s no such thing as something so novel it can’t be understood by LLMs based on existing knowledge, but then this would be equivalent to arguing that LLMs have or will have a very high level of data efficiency.
i’m fleshing out nunos point a bit. Basically AI have so many systematic advantages with their cost/speed/seemless integration into the digital world that they can afford to be worse than humans at a variety of things and still automate (most/all/some) work. Just as a plane doesn’t need to flap it’s wings. Of course I wasn’t saying I solved automating the economy. I’m just showing you ways in which something lacking some top level human common sense/iq/whatever could replace still.
FWIW I basically disagree with every point you made in the summary. This mostly just comes from using these tools every day and getting utility out of them + seeing how fast they are improving + seeing how many different routes there are to improvement (i was quite skeptical a year ago, not so anymore). But I wanted to keep the argument contained and isolate a point of disagreement.
I want to try to separate out a few different ideas because I worry they might get confused together.
Are actual existing LLMs good at discovering novel ideas? No. They haven’t discovered anything useful in any domain yet. They haven’t come up with any interesting new idea in science, math, economics, medicine, or anything.
Could LLMs eventually discover novel ideas in the way you described? I don’t think so. I think you’re saying you think this will happen. Okay, so, why? What are LLMs missing now that they will have in, say, 5 years that will mean they make the jump from zero novel ideas to lots of novel ideas? Is it just scale?
Would an AI system that can’t learn new ideas from one example or a few examples count as AGI? No, I don’t think so.
Would an AI system that can’t learn new ideas from one example or a few examples be able to automate all human labour? No, I don’t think so because this kind of learning is part of many different jobs, such as scientist, philosopher, and journalist, and also taxi driver (per the above point about autonomous vehicles).
I do use ChatGPT every day and find it to be a useful tool for what I use it for, which is mainly a form of search engine. I used ChatGPT when it first launched, as well as GPT-4 when it first launched, and have been following the progress.
Everything is relative to expectations. If I’m judging ChatGPT based on the expectations of a typical consumer tech product, or even a cool AI science experiment, then I do find the progress impressive. On the other hand, if I’m judging ChatGPT as a potential precursor to AGI, I don’t find the progress particularly impressive.
I guess I don’t see the potential routes to improvement that you see. The ones that I’ve seen discussed don’t strike me as promising.
https://x.com/slow_developer/status/1979157947529023997
I would bet a lot of money you are going to see exactly what I described for math in the next two years. The capabilities literally just exploded. It took us like 20 years to start using the lightbulb but you are expecting results from products that came out in the last few weeks/months.
I can also confidently say because I am working on a project with doctors that the work I described for clinical medicine is being tested and happening right now. It’s exact usefulness remains to be seen but like people are trying exactly what I described, there will be some lag as people need to learn how to use the tools best and then distribute their results.
Again, I don’t think most of this stuff was particularly useful with the tools available to use >1 year ago.
>Would an AI system that can’t learn new ideas from one example or a few examples count as AGI?
https://www.anthropic.com/news/skills
you are going to need to be a lot more precise in your definitions imo otherwise we are going to talk past each other.
The math example you cited doesn’t seem to an example of an LLM coming up with a novel idea in math. It just sounds like mathematicians are using an LLM as a search tool. I agree that LLMs are really useful for search, but this is a far cry from an LLM actually coming up with a novel idea itself.
The point you raise about LLMs doing in-context learning is ably discussed the video I embedded in the post.
“novel idea” means almost nothing to me. A math proof is simply a->b. It doesn’t matter how you figure out a->b. If you can figure it out by reading 16 million papers and clicking them together that still counts. There are many ways to cook an egg.
I don’t think the LLMs in this case are clicking them together. Rather, it seems like the LLMs are being used as a search tool for human mathematicians who are clicking them together.
If you could give the LLM a prompt along the lines of, “Read the mathematics literature and come up with some new proofs based on that,” and it could do it, then I would count that as an LLM successfully coming up with a proof, and with a novel idea.
Based on the tweets you linked to, what seems to be happening is that the LLMs are being used as a search tool like Google Scholar, and it’s the mathematicians coming up with the proofs, not the search engine.
Sure that’s a fair point. I’d guess I hope you would feel at least a little pushed in the direction after this thread that AIs need not take a similar route to humans to automating large amounts of our current work.
LLMs may have some niches in which they enhance productivity, such as by serving as an advanced search engine or text search tool for mathematicians. This is quite different than AGI and quite different from either:
a) LLMs having a broad impact on productivity across the economy (which would not necessarily amount to AGI but which would be economically significant)
or
b) LLMs fully automating jobs by acting autonomously and doing hierarchical planning over very long time horizons (which is the sort of thing AGI would have to be capable of doing to meet the conventional definition of AGI).
If you want to argue LLMs will get from their current state where they can’t do (a) or (b) to a state where they will be able to do (a) and/or (b), then I think you have to address my arguments in the post about LLMs’ apparent fundamental weaknesses (e.g. the Tower of Hanoi example seems stark to me) and what I said about the obstacles to scaling LLMs further (e.g. Epoch AI estimates that data may run out around 2028).