I agree that whether or not we get AGI is a crux for this topic. Though it makes sense to update our cause priorities even if AI is merely transformational.
Your comment seems overconfident, however (“essentially no chance of AGI”). This seems to not take into account that many (most?) intellectual tasks see progress. For example ARC-AGI-2 had scores below 10% at the beginning of the year, and within just few months the best solution on https://arcprize.org/leaderboard scores 29%. Even publicly available models without custom scaffolding score >10% now.
Of course, there could be a plateau… and I hear ARC-AGI-3 is in the works… but I don’t understand your high confidence despite AI’s seemingly consistent rise in all the tests that humanity throws at it.
Forgive me for the very long reply. I’m sure that you and others on the EA Forum have heard the case for near-term AGI countless times, often at great depth, but the opposing case is rarely articulated in EA circles, so I wanted to do it justice that a tweet-length reply could not do.
Why does the information we have now indicate AGI within 7 years and not, say, 17 years or 70 years or 170 years? If progress in science and technology continues indefinitely, then eventually we will gain the knowledge required to build AGI. But when is eventually? And why would it be so incredibly soon? To say that some form of progress is being made is not the same as making an argument for AGI by 2032, as opposed to 2052 or 2132.
I wouldn’t say that LLM benchmarks accurately represent what real intellectual tasks are actually like. First, the benchmarks are designed to be solvable by LLMs because they are primarily intended to measure LLMs against each other and to measure improvements in subsequent versions of the same LLM model line (e.g. GPT-5 vs GPT-4o). There isn’t much incentive to create LLM benchmarks where LLMs stagnate around 0%.[1]
Even ARC-AGI 1, 2, and 3, which are an exception in terms of their purpose and design are still intended to be in the sweet spot between too easy to be a real challenge and too hard to see progress on. If a benchmark is easy to solve or impossible to solve, it won’t encourage AI researchers and engineers to try hard to solve it and make improvements to their models in the process. The intention of the ARC-AGI is to give people working on AI a shared point of focus and a target to aim for. The purpose is not to make a philosophical or scientific point about what current AI systems can’t do. The benchmarks are designed to be solved by current AI systems.
It always bears repeating, since confusion is possible, that the ARC-AGI benchmarks are not intended to test whether a system is AGI or not, but are rather intended to test whether AI systems are making progress toward AGI. So, getting 95%+ on ARC-AGI-2 would not mean AGI is solved, but it would be a sign of progress — or at least that is the intention.
Second, for virtually all LLMs tests or benchmarks, the definition of success or failure on the tasks has to be reduced to something simple enough that software can grade the task automatically. This is a big limitation.
When I think about the sort of intellectual tasks that humans do, not a lot of them can be graded automatically. Of course, there are written exams and tests with multiple choice answers, but these are primarily tests of memorization. Don’t get me wrong, it is impressive that LLMs can memorize essentially all text ever written, but memorization is only one aspect of intelligence. We want AI systems that go beyond just memorizing things from huge numbers of examples and can also solve completely novel problems that aren’t a close match for anything in their training dataset. That’s where LLMs are incredibly brittle and start just generating nonsense, saying plainly false (and often ridiculous) things, contradicting themselves, hallucinating, etc. Some great examples are here, and there’s also an important discussion of how these holes in LLM reasoning get manually patched by paying large workforces to write new training examples specifically to fix them. This creates an impression of increased intelligence, but the improvement isn’t from scaling in these cases, it’s from large-scale special casing.
I think the most robust tests of AI capabilities are tasks that have real world value. If AI systems are actually doing the same intellectual tasks as human beings, then we should see AI systems either automating labour or increasing worker productivity. We don’t see that. In fact, I’m aware of two studies that looked at the impact of AI assistance on human productivity. One study on customer support workers found mixed results, including a negative impact on productivity for the most experienced employees. Another study, by METR, found a 19% reduction in productivity when coders used an AI coding assistant.
In industry, non-AI companies that have invested in applying AI to the work they do are not seeing that much payoff. There might be some modest benefits in some niches. I’m sure there are at least a few. But LLMs are not going to be transformational to the economy. Let alone automate all office work.
Personally, I find ChatGPT to extremely useful as an enhanced search engine. I call it SuperGoogle. If I want to find a news article or an academic paper or whatever about a certain very specific topic, I can ask GPT-5 Thinking or o3 to go look to see if anything like that exists. For example, I can say, “Find me any studies that have been published comparing the energy usage of biological systems to technological systems, excluding brains and computers.” And then it often will give me some stuff that isn’t helpful, but it is good enough at digging up a really helpful link often enough that it’s still a really useful tool overall. I don’t know how much time this saves me Googling, but it feels useful. (It’s possible like the AI coders in the METR study, I’m falling prey to the illusion that it’s saving me time when actually, on net, it wastes time, but who knows.)
This is a genuine innovation. Search engines are an important tool and such a helpful innovation on the search engine is a meaningful accomplishment. But this is an innovation on the scale of Spotify allowing us to stream music, rather than something on the scale of electricity or the steam engine or the personal computer. Let alone something as revolutionary as the evolution of the human prefrontal cortex.
If LLMs were genuinely replicating human intelligence, we would expect to see an economic impact, excluding the impact of investment. Investment is certainly having an impact, but, as the economist John Maynard Keynes said, if you pay enough people to dig holes and then fill the same holes up with dirt again, that stimulus will impact the economy (and may even get you out of a recession). What economic impact is AI having over and above the impact that would have been had by using the same capital on to pay people to dig holes and fill them back up? From the data I’ve seen, the impact is quite modest and a huge amount of capital has been wasted. I think within a few years many people will probably see the recent AI investment boom as just a stunningly bad misallocation of capital.[2]
People draw analogies to the transcontinental railway boom and the dot com bubble, but they also point out that railways and fibre-optic cable depreciate at a much slower rate than GPUs. Different companies calculate the depreciation of GPUs at different rates, typically ranging from 1 year to 6 years. Data centres have non-GPU aspects like the buildings and the power connects that are more durable, but the GPUs account for more than half of the costs. So, overbuilding capacity for demand that doesn’t ultimately materialize would be extremely wasteful. Watch this space.[3]
If you think an LLM scoring more than 100 on an IQ test means it’s AGI, then we’ve had AGI for several years. But clearly there’s a problem with that inference, right? Memorizing the answers to IQ tests, or memorizing similar answers to similar questions that you can interpolate, doesn’t mean a system actually has the kind of intelligence to solve completely novel problems that have never appeared on any test, or in any text. The same general critique applies to the inference that LLMs are intelligent from their results on virtually any LLM benchmark. Memorization is not intelligence.
If we instead look at performance on practical, economically valuable tasks as the test for AI’s competence at intellectual tasks, then its competence appears quite poor. People who make the flawed inference from benchmarks just described say that LLMs can do basically anything. If they instead derived their assessment from LLMs’ economic usefulness, it would be closer to the truth to say LLMs can do almost nothing.
There is also some research on non-real world tasks that supports the idea that LLMs are mass-scale memorizers with a modicum of interpolation or generalization to examples similar to what they’ve been trained on, rather than genuinely intelligent (in the sense that humans are intelligent). The Apple paper on “reasoning” models found surprisingly mixed results on common puzzles. The finding that sticks out most in my mind is that the LLM’s performance on the Tower of Hanoi puzzle did not improve after being told the algorithm for solving the puzzle. Is that real intelligence?
It’s possible, at least in principle (not sure it often happens in practice), to acknowledge these flaws in LLMs and still believe in near-term AGI. If there’s enough progress in AI fast enough, then we could have AGI within 7 years. This is true, but it was also true ten years ago. When AlphaGo beat Lee Sedol in 2016, you could have said we’ll have AGI within 7 years — because, sure, being superhuman at go isn’t that close to AGI, but look at how fast the progress has been, and imagine how fast the progress will be![4] If you think it’s just a matter of scaling, then I could understand how you would see the improvement as predictable. But I think the flaws in LLMs are inherent to LLMs and can’t be solved through scaling. The video from AI researcher Edan Meyer that I linked to in my original comment makes an eloquent case for this. As does the video with François Chollet.
There are other problems with the scaling story:
There is evidence that scaling LLMs is running out of steam. Toby Ord’s interview on the 80,000 Hours podcast in June covered this topic really well. Renowned AI researcher Ilya Sutskever, formerly chief scientist at OpenAI (prior to voting to fire Sam Altman), has said he thinks the benefits from scaling LLM pre-training have plateaued. There have been reports that, internally, employees at AI labs are disappointed with their models’ progress. GPT-5 doesn’t seem like that much of an improvement over previous models.
There are practical limits to scaling up, even if the benefits to scaling weren’t diminishing. Epoch AI’s median estimate of when LLMs will run out of data to train on is 2028. Epoch AI also predicts that compute scaling will slow down mainly due to financial and economic considerations.
The benefits to scaling are diminishing and, at the same time, data scaling and compute scaling may have to slow down sometime soon (if this is not already happening).
If you expand the scope of LLM performance beyond written prompts and responses to “agentic” applications, I think LLMs’ failures are more stark and the models do not seem to be gaining mastery of these tasks particularly quickly. Journalists generally say that companies’ demos of agentic AI don’t work.
I don’t expect that performance on agentic tasks will rapidly improve. To train on text-based tasks, AI labs can get data from millions of books and large-scale scrapes of the Internet. There aren’t similarly sized datasets for agentic tasks. In principle, you can use pure reinforcement learning without bootstrapping from imitation learning, but while this approach has succeeded in domains with smaller spaces of possible actions like go, it has failed in domains with larger spaces of possible actions like StarCraft. I don’t think agentic AI will get particularly better over the next few years. Also, the current discrepancy between LLM performance on text-based tasks and agentic tasks tells us something about whether LLMs are genuinely intelligent. What kind of PhD student can’t use a computer?
So, to briefly summarize the core points of this very long comment:
LLM benchmarks don’t really tell us how genuinely intelligent LLMs are. They are designed to be easy for LLMs and to be automatically graded, which limits what can be tested.
On economically valuable tasks in real world settings, which I believe are much better tests than benchmarks, LLMs do quite poorly. Not only does this make near-term AGI seem very unlikely, it also makes economically transformative AI in the near term seem very unlikely.
LLMs fail all the time at tasks we would not expect them to fail at if they were genuinely intelligent, as opposed to relying on mass-scale memorization.
Scaling isn’t a solution to the fundamental flaws in LLMs and, in any case, the benefits of scaling are diminishing at the same time that LLM companies are encountering practical limits that may slow compute scaling and slow or even stop data scaling.
LLMs are terrible at agentic tasks and there isn’t enough training data for them to improve, if training data is what it takes. If LLMs are genuinely intelligent, we should ask why they can’t learn agentic tasks from a small number of examples, since this is what humans do.
Maybe it’s worth mentioning the very confusing AI Impacts surveyconducted in 2022 where the surveyors gave 2,778 AI researchers essentially two different descriptions of an AI system that could be construed as an AGI and also could be construed as equivalent to each other (I don’t know why they designed the survey like this) and, aggregating the AI researchers’ replies, found they assign a 50% chance of AGI by 2047 (or a 10% chance by 2027) on one definition and a 50% chance of AGI by 2116 (or a 10% chance by 2037) on another definition.
[Important correction: this is actually the 2023 AI Impacts survey, which was conducted in October 2023, seven months after the release of GPT-4 in March 2023.
This correction was added on October 28, 2025 at 10:31 AM Eastern.]
In 2022, there was also a survey of superforecasters with a cleaner definition of AGI. They, in aggregate, assigned a 1% chance of AGI by 2030, a 21% chance by 2050, a 50% chance by 2081, and a 75% chance by 2100.
Both the AI Impact survey and the superforecaster survey were conducted before the launch of ChatGPT. I would guess ChatGPT would probably have led them to shorten their timelines, but if LLMs are more or less a dead end, as people like the Turing Award winners Yann LeCun and Richard Sutton have argued,[5] then this would be a mistake. (In a few years, if things go the way I expect, meaning that generative AI turns out to be completely disappointing and this is reflected in finance and the economy, then I would guess the same people would then lengthen their timelines again.) In any case, it would be interesting to run the surveys again now.[See the correction above. The superforecaster survey was conducted before the release of ChatGPT, but the survey of AI experts was conducted after the release of GPT-4.]
I think it might be useful to bring up just to disrupt the impression some people in EA might have that there is an expert consensus that near-term AGI is likely. I imagine even if these surveys were re-run now, we would still see a small chance of AGI by 2032. Strongly held belief in near-term AGI exists in a bit of a bubble or echo chamber and if you’re in the minority on an issue among well-informed people, that can stimulate some curiosity about why so many people disagree with you.
In truth, I don’t think we can predict when a technology will be invented, particularly when we don’t understand the science behind it. I am highly skeptical that we can gain meaningful knowledge by just asking people to guess a year. So, it really is just to stimulate curiosity.
There are a lot of strong, substantive, well-informed arguments against near-term AGI and against the idea that LLMs will scale to AGI. I find it strange how little I see people in EA engage with these arguments or even know what they are. It’s weird to me that a lot of people are willing to, in some sense, stake the reputation of EA and, to some degree, divert money away from GiveWell-recommended charities without, as far as I’ve seen, much in the way of considering opposing viewpoints. It seems like a lack of due diligence.
However, it would be easy to do so, especially if you’re willing to do manual grading. Task an LLM with making stock picks that achieve alpha — you could grade that automatically. Try to coax LLMs into coming up with a novel scientific discovery or theoretical insight. Despite trillions of tokens generated, it hasn’t happened yet. Tasks related to computer use and “agentic” use cases are also sure to lead to failures. For example, make it play a video game it’s never seen before (e.g. because the game just came out). Or, if the game is slow-paced enough, simply give you instructions on how to play. You can abstract out the computer vision aspect of these tests if you want, although it’s worth asking how we’re going to have AGI if it can’t see.
A BofA Global Research’s monthly fund manager survey revealed that 54% of investors said they thought that AI stocks were in a bubble compared with 38% who do not believe that a bubble exists.
However, you’d think if this accurately reflected the opinions of people in finance, the bubble would have popped already.
The FTX collapse caused a lot of reputational damage for EA. Depending on how you look at it, AI investments collapsing could cause an even greater amount of reputational damage for EA. So much of EA has gone all-in on near-term AGI and the popping of an AI financial bubble would be hard to square with that. Maybe this is melodramatic because the FTX situation was about concerns of immoral conduct on the part of people in EA and the AI financial bubble would just be about people in EA being epistemically misguided. I don’t know anything and I can’t predict the future.
Sutton’s reinforcement learning-oriented perspective, or something close to Sutton’s perspective, anyway, is eloquently argued for in the video by the AI researcher Edan Meyer.
These are good arguments. Some were new to me, many I was already aware of. For me, the overall effect of the arguments, benchmarks, and my own experience is to make me think that a lot of scenarios are plausible. There is a wide uncertainty range. It might well be that AGI takes a long time to happen, but I also see many trends that indicate it could arrive surprisingly quickly.
For you, the overall conclusion from all the arguments is to completely rule out near-term AGI. That still seems quite wildly overconfident, even if there is a decent case being made for long timelines.
Important correction to my comment above: the AI Impacts survey was actually conducted in October 2023, which is 7 months after the release of GPT-4 in March 2023. So, it does actually reflect AI researchers’ views on AGI timelines after given time to absorb the impact of ChatGPT and GPT-4.
The XPT superforecasting survey I mentioned was, however, indeed conducted in 2022 just before the launch of ChatGPT in November 2022. So, that’s still a pre-ChatGPT forecast.
I just published a post here about these forecasts. I also wrote a post about 2 weeks ago that adapted my comments above, although unfortunately it didn’t lead to much discussion. I would love to stimulate more debate about this topic.
It would be great, even, if the EA Forum did some kind of debate week or essay competition around whether near-term AGI is likely. Maybe I will suggest that.
I don’t really have a gripe with people who want to put relatively small probabilities on near-term AGI, like the superforecasters who guessed there’s a 1% chance of AGI by 2030. Who knows anything about anything? Maybe Jill Stein has a 1% chance of winning in 2028! But 50% by 2032 is definitely way too high and I actually don’t think there’s a rational basis for thinking that.
I agree that whether or not we get AGI is a crux for this topic. Though it makes sense to update our cause priorities even if AI is merely transformational.
Your comment seems overconfident, however (“essentially no chance of AGI”). This seems to not take into account that many (most?) intellectual tasks see progress. For example ARC-AGI-2 had scores below 10% at the beginning of the year, and within just few months the best solution on https://arcprize.org/leaderboard scores 29%. Even publicly available models without custom scaffolding score >10% now.
Of course, there could be a plateau… and I hear ARC-AGI-3 is in the works… but I don’t understand your high confidence despite AI’s seemingly consistent rise in all the tests that humanity throws at it.
Forgive me for the very long reply. I’m sure that you and others on the EA Forum have heard the case for near-term AGI countless times, often at great depth, but the opposing case is rarely articulated in EA circles, so I wanted to do it justice that a tweet-length reply could not do.
Why does the information we have now indicate AGI within 7 years and not, say, 17 years or 70 years or 170 years? If progress in science and technology continues indefinitely, then eventually we will gain the knowledge required to build AGI. But when is eventually? And why would it be so incredibly soon? To say that some form of progress is being made is not the same as making an argument for AGI by 2032, as opposed to 2052 or 2132.
I wouldn’t say that LLM benchmarks accurately represent what real intellectual tasks are actually like. First, the benchmarks are designed to be solvable by LLMs because they are primarily intended to measure LLMs against each other and to measure improvements in subsequent versions of the same LLM model line (e.g. GPT-5 vs GPT-4o). There isn’t much incentive to create LLM benchmarks where LLMs stagnate around 0%.[1]
Even ARC-AGI 1, 2, and 3, which are an exception in terms of their purpose and design are still intended to be in the sweet spot between too easy to be a real challenge and too hard to see progress on. If a benchmark is easy to solve or impossible to solve, it won’t encourage AI researchers and engineers to try hard to solve it and make improvements to their models in the process. The intention of the ARC-AGI is to give people working on AI a shared point of focus and a target to aim for. The purpose is not to make a philosophical or scientific point about what current AI systems can’t do. The benchmarks are designed to be solved by current AI systems.
It always bears repeating, since confusion is possible, that the ARC-AGI benchmarks are not intended to test whether a system is AGI or not, but are rather intended to test whether AI systems are making progress toward AGI. So, getting 95%+ on ARC-AGI-2 would not mean AGI is solved, but it would be a sign of progress — or at least that is the intention.
Second, for virtually all LLMs tests or benchmarks, the definition of success or failure on the tasks has to be reduced to something simple enough that software can grade the task automatically. This is a big limitation.
When I think about the sort of intellectual tasks that humans do, not a lot of them can be graded automatically. Of course, there are written exams and tests with multiple choice answers, but these are primarily tests of memorization. Don’t get me wrong, it is impressive that LLMs can memorize essentially all text ever written, but memorization is only one aspect of intelligence. We want AI systems that go beyond just memorizing things from huge numbers of examples and can also solve completely novel problems that aren’t a close match for anything in their training dataset. That’s where LLMs are incredibly brittle and start just generating nonsense, saying plainly false (and often ridiculous) things, contradicting themselves, hallucinating, etc. Some great examples are here, and there’s also an important discussion of how these holes in LLM reasoning get manually patched by paying large workforces to write new training examples specifically to fix them. This creates an impression of increased intelligence, but the improvement isn’t from scaling in these cases, it’s from large-scale special casing.
I think the most robust tests of AI capabilities are tasks that have real world value. If AI systems are actually doing the same intellectual tasks as human beings, then we should see AI systems either automating labour or increasing worker productivity. We don’t see that. In fact, I’m aware of two studies that looked at the impact of AI assistance on human productivity. One study on customer support workers found mixed results, including a negative impact on productivity for the most experienced employees. Another study, by METR, found a 19% reduction in productivity when coders used an AI coding assistant.
In industry, non-AI companies that have invested in applying AI to the work they do are not seeing that much payoff. There might be some modest benefits in some niches. I’m sure there are at least a few. But LLMs are not going to be transformational to the economy. Let alone automate all office work.
Personally, I find ChatGPT to extremely useful as an enhanced search engine. I call it SuperGoogle. If I want to find a news article or an academic paper or whatever about a certain very specific topic, I can ask GPT-5 Thinking or o3 to go look to see if anything like that exists. For example, I can say, “Find me any studies that have been published comparing the energy usage of biological systems to technological systems, excluding brains and computers.” And then it often will give me some stuff that isn’t helpful, but it is good enough at digging up a really helpful link often enough that it’s still a really useful tool overall. I don’t know how much time this saves me Googling, but it feels useful. (It’s possible like the AI coders in the METR study, I’m falling prey to the illusion that it’s saving me time when actually, on net, it wastes time, but who knows.)
This is a genuine innovation. Search engines are an important tool and such a helpful innovation on the search engine is a meaningful accomplishment. But this is an innovation on the scale of Spotify allowing us to stream music, rather than something on the scale of electricity or the steam engine or the personal computer. Let alone something as revolutionary as the evolution of the human prefrontal cortex.
If LLMs were genuinely replicating human intelligence, we would expect to see an economic impact, excluding the impact of investment. Investment is certainly having an impact, but, as the economist John Maynard Keynes said, if you pay enough people to dig holes and then fill the same holes up with dirt again, that stimulus will impact the economy (and may even get you out of a recession). What economic impact is AI having over and above the impact that would have been had by using the same capital on to pay people to dig holes and fill them back up? From the data I’ve seen, the impact is quite modest and a huge amount of capital has been wasted. I think within a few years many people will probably see the recent AI investment boom as just a stunningly bad misallocation of capital.[2]
People draw analogies to the transcontinental railway boom and the dot com bubble, but they also point out that railways and fibre-optic cable depreciate at a much slower rate than GPUs. Different companies calculate the depreciation of GPUs at different rates, typically ranging from 1 year to 6 years. Data centres have non-GPU aspects like the buildings and the power connects that are more durable, but the GPUs account for more than half of the costs. So, overbuilding capacity for demand that doesn’t ultimately materialize would be extremely wasteful. Watch this space.[3]
If you think an LLM scoring more than 100 on an IQ test means it’s AGI, then we’ve had AGI for several years. But clearly there’s a problem with that inference, right? Memorizing the answers to IQ tests, or memorizing similar answers to similar questions that you can interpolate, doesn’t mean a system actually has the kind of intelligence to solve completely novel problems that have never appeared on any test, or in any text. The same general critique applies to the inference that LLMs are intelligent from their results on virtually any LLM benchmark. Memorization is not intelligence.
If we instead look at performance on practical, economically valuable tasks as the test for AI’s competence at intellectual tasks, then its competence appears quite poor. People who make the flawed inference from benchmarks just described say that LLMs can do basically anything. If they instead derived their assessment from LLMs’ economic usefulness, it would be closer to the truth to say LLMs can do almost nothing.
There is also some research on non-real world tasks that supports the idea that LLMs are mass-scale memorizers with a modicum of interpolation or generalization to examples similar to what they’ve been trained on, rather than genuinely intelligent (in the sense that humans are intelligent). The Apple paper on “reasoning” models found surprisingly mixed results on common puzzles. The finding that sticks out most in my mind is that the LLM’s performance on the Tower of Hanoi puzzle did not improve after being told the algorithm for solving the puzzle. Is that real intelligence?
It’s possible, at least in principle (not sure it often happens in practice), to acknowledge these flaws in LLMs and still believe in near-term AGI. If there’s enough progress in AI fast enough, then we could have AGI within 7 years. This is true, but it was also true ten years ago. When AlphaGo beat Lee Sedol in 2016, you could have said we’ll have AGI within 7 years — because, sure, being superhuman at go isn’t that close to AGI, but look at how fast the progress has been, and imagine how fast the progress will be![4] If you think it’s just a matter of scaling, then I could understand how you would see the improvement as predictable. But I think the flaws in LLMs are inherent to LLMs and can’t be solved through scaling. The video from AI researcher Edan Meyer that I linked to in my original comment makes an eloquent case for this. As does the video with François Chollet.
There are other problems with the scaling story:
There is evidence that scaling LLMs is running out of steam. Toby Ord’s interview on the 80,000 Hours podcast in June covered this topic really well. Renowned AI researcher Ilya Sutskever, formerly chief scientist at OpenAI (prior to voting to fire Sam Altman), has said he thinks the benefits from scaling LLM pre-training have plateaued. There have been reports that, internally, employees at AI labs are disappointed with their models’ progress. GPT-5 doesn’t seem like that much of an improvement over previous models.
There are practical limits to scaling up, even if the benefits to scaling weren’t diminishing. Epoch AI’s median estimate of when LLMs will run out of data to train on is 2028. Epoch AI also predicts that compute scaling will slow down mainly due to financial and economic considerations.
The benefits to scaling are diminishing and, at the same time, data scaling and compute scaling may have to slow down sometime soon (if this is not already happening).
If you expand the scope of LLM performance beyond written prompts and responses to “agentic” applications, I think LLMs’ failures are more stark and the models do not seem to be gaining mastery of these tasks particularly quickly. Journalists generally say that companies’ demos of agentic AI don’t work.
I don’t expect that performance on agentic tasks will rapidly improve. To train on text-based tasks, AI labs can get data from millions of books and large-scale scrapes of the Internet. There aren’t similarly sized datasets for agentic tasks. In principle, you can use pure reinforcement learning without bootstrapping from imitation learning, but while this approach has succeeded in domains with smaller spaces of possible actions like go, it has failed in domains with larger spaces of possible actions like StarCraft. I don’t think agentic AI will get particularly better over the next few years. Also, the current discrepancy between LLM performance on text-based tasks and agentic tasks tells us something about whether LLMs are genuinely intelligent. What kind of PhD student can’t use a computer?
So, to briefly summarize the core points of this very long comment:
LLM benchmarks don’t really tell us how genuinely intelligent LLMs are. They are designed to be easy for LLMs and to be automatically graded, which limits what can be tested.
On economically valuable tasks in real world settings, which I believe are much better tests than benchmarks, LLMs do quite poorly. Not only does this make near-term AGI seem very unlikely, it also makes economically transformative AI in the near term seem very unlikely.
LLMs fail all the time at tasks we would not expect them to fail at if they were genuinely intelligent, as opposed to relying on mass-scale memorization.
Scaling isn’t a solution to the fundamental flaws in LLMs and, in any case, the benefits of scaling are diminishing at the same time that LLM companies are encountering practical limits that may slow compute scaling and slow or even stop data scaling.
LLMs are terrible at agentic tasks and there isn’t enough training data for them to improve, if training data is what it takes. If LLMs are genuinely intelligent, we should ask why they can’t learn agentic tasks from a small number of examples, since this is what humans do.
Maybe it’s worth mentioning the very confusing AI Impacts survey
conducted in 2022where the surveyors gave 2,778 AI researchers essentially two different descriptions of an AI system that could be construed as an AGI and also could be construed as equivalent to each other (I don’t know why they designed the survey like this) and, aggregating the AI researchers’ replies, found they assign a 50% chance of AGI by 2047 (or a 10% chance by 2027) on one definition and a 50% chance of AGI by 2116 (or a 10% chance by 2037) on another definition.[Important correction: this is actually the 2023 AI Impacts survey, which was conducted in October 2023, seven months after the release of GPT-4 in March 2023.
This correction was added on October 28, 2025 at 10:31 AM Eastern.]
In 2022, there was also a survey of superforecasters with a cleaner definition of AGI. They, in aggregate, assigned a 1% chance of AGI by 2030, a 21% chance by 2050, a 50% chance by 2081, and a 75% chance by 2100.
Both the AI Impact survey and the superforecaster survey were conducted before the launch of ChatGPT. I would guess ChatGPT would probably have led them to shorten their timelines, but if LLMs are more or less a dead end, as people like the Turing Award winners Yann LeCun and Richard Sutton have argued,[5]then this would be a mistake. (In a few years, if things go the way I expect, meaning that generative AI turns out to be completely disappointing and this is reflected in finance and the economy, then I would guess the same people would then lengthen their timelines again.) In any case, it would be interesting to run the surveys again now.[See the correction above. The superforecaster survey was conducted before the release of ChatGPT, but the survey of AI experts was conducted after the release of GPT-4.]I think it might be useful to bring up just to disrupt the impression some people in EA might have that there is an expert consensus that near-term AGI is likely. I imagine even if these surveys were re-run now, we would still see a small chance of AGI by 2032. Strongly held belief in near-term AGI exists in a bit of a bubble or echo chamber and if you’re in the minority on an issue among well-informed people, that can stimulate some curiosity about why so many people disagree with you.
In truth, I don’t think we can predict when a technology will be invented, particularly when we don’t understand the science behind it. I am highly skeptical that we can gain meaningful knowledge by just asking people to guess a year. So, it really is just to stimulate curiosity.
There are a lot of strong, substantive, well-informed arguments against near-term AGI and against the idea that LLMs will scale to AGI. I find it strange how little I see people in EA engage with these arguments or even know what they are. It’s weird to me that a lot of people are willing to, in some sense, stake the reputation of EA and, to some degree, divert money away from GiveWell-recommended charities without, as far as I’ve seen, much in the way of considering opposing viewpoints. It seems like a lack of due diligence.
However, it would be easy to do so, especially if you’re willing to do manual grading. Task an LLM with making stock picks that achieve alpha — you could grade that automatically. Try to coax LLMs into coming up with a novel scientific discovery or theoretical insight. Despite trillions of tokens generated, it hasn’t happened yet. Tasks related to computer use and “agentic” use cases are also sure to lead to failures. For example, make it play a video game it’s never seen before (e.g. because the game just came out). Or, if the game is slow-paced enough, simply give you instructions on how to play. You can abstract out the computer vision aspect of these tests if you want, although it’s worth asking how we’re going to have AGI if it can’t see.
From a Reuters article published today:
However, you’d think if this accurately reflected the opinions of people in finance, the bubble would have popped already.
The FTX collapse caused a lot of reputational damage for EA. Depending on how you look at it, AI investments collapsing could cause an even greater amount of reputational damage for EA. So much of EA has gone all-in on near-term AGI and the popping of an AI financial bubble would be hard to square with that. Maybe this is melodramatic because the FTX situation was about concerns of immoral conduct on the part of people in EA and the AI financial bubble would just be about people in EA being epistemically misguided. I don’t know anything and I can’t predict the future.
Some people, like Elon Musk, have indeed said things similar to this in response to DeepMind’s impressive results.
Sutton’s reinforcement learning-oriented perspective, or something close to Sutton’s perspective, anyway, is eloquently argued for in the video by the AI researcher Edan Meyer.
Thanks for the long reply!
These are good arguments. Some were new to me, many I was already aware of. For me, the overall effect of the arguments, benchmarks, and my own experience is to make me think that a lot of scenarios are plausible. There is a wide uncertainty range. It might well be that AGI takes a long time to happen, but I also see many trends that indicate it could arrive surprisingly quickly.
For you, the overall conclusion from all the arguments is to completely rule out near-term AGI. That still seems quite wildly overconfident, even if there is a decent case being made for long timelines.
Important correction to my comment above: the AI Impacts survey was actually conducted in October 2023, which is 7 months after the release of GPT-4 in March 2023. So, it does actually reflect AI researchers’ views on AGI timelines after given time to absorb the impact of ChatGPT and GPT-4.
The XPT superforecasting survey I mentioned was, however, indeed conducted in 2022 just before the launch of ChatGPT in November 2022. So, that’s still a pre-ChatGPT forecast.
I just published a post here about these forecasts. I also wrote a post about 2 weeks ago that adapted my comments above, although unfortunately it didn’t lead to much discussion. I would love to stimulate more debate about this topic.
It would be great, even, if the EA Forum did some kind of debate week or essay competition around whether near-term AGI is likely. Maybe I will suggest that.
I don’t really have a gripe with people who want to put relatively small probabilities on near-term AGI, like the superforecasters who guessed there’s a 1% chance of AGI by 2030. Who knows anything about anything? Maybe Jill Stein has a 1% chance of winning in 2028! But 50% by 2032 is definitely way too high and I actually don’t think there’s a rational basis for thinking that.