On January 1, 2030, there will be no AGI (and AGI will still not be imminent)
On January 1, 2030, there will be no artificial general intelligence (AGI) and AGI will still not be imminent.
A few reasons why I think this:
-If you look at easy benchmarks like ARC-AGI and ARC-AGI-2 that are easy for humans to solve and intentionally designed to be a low bar for AI to clear, the weaknesses of frontier AI models are starkly revealed.[1]
-Casual, everyday use of large language models (LLMs) reveals major errors on simple thinking tasks, such as not understanding that an event that took place in 2025 could not have caused an event that took place in 2024.
-Progress does not seem like a fast exponential trend, faster than Mooreâs law and laying the groundwork for an intelligence explosion. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to train models is probably happening faster than Mooreâs law, but not the actual intelligence of the models.[2]
-Most AI experts and most superforecasters give much more conservative predictions when surveyed about AGI, closer to 50 or 100 years than 5 or 10 years.[3]
-Most AI experts are skeptical that scaling up LLMs could lead to AGI.[4]
-It seems like there are deep, fundamental scientific discoveries and breakthroughs that would need to be made for building AGI to become possible. There is no evidence weâre on the cusp of those happening and it seems like they could easily take many decades.
-Some of the well-known people who are making aggressive predictions about the timeline of AGI now have also made aggressive predictions about the timeline of AGI in the past that were wrong.[5]
-The stock market doesnât think AGI is coming in 5 years.[6]
-There has been little if any clear, observable effect of AI on economic productivity or the productivity of individual firms.[7]
-AI canât yet replace human translators or do other jobs that it seems best-positioned to overtake.
-Progress on AI robotics problems, such as fully autonomous driving, has been dismal. (However, autonomous driving companies have good PR and marketing right up until the day they announce theyâre shutting down.)
-Discourse about AGI sounds way too millennialist and thatâs a reason for skepticism.
-The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the groupâs norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesnât feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each otherâs ideas all the time. It doesnât seem like an intellectually healthy community.
-A lot of the aforementioned points have been made before and there havenât been any good answers to them.
Iâd like to thank Sam Altman, Dario Amodei, Demis Hassabis, Yann LeCun, Elon Musk, and several others who declined to be named for giving me notes on each of the sixteen drafts of this post I shared with them over the past three months. Your feedback helped me polish a rough stone of thought into a diamond of incisive criticism.[8]
Note: I edited this post on 2025-04-12 at 20:30 UTC to add some footnotes.
- ^
This video is a good introduction to these benchmarks. If you prefer to read, this blog post is another good introduction. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- ^
I realized after thinking about it more that trying to guess whether the general intelligence of AI models has been increasing slower or faster than Mooreâs law from November 2022 to April 2025 is probably not a helpful exercise. I explain why in three sequential comments here, here, and here, and in that third comment, I re-write this paragraph to convey my intended meaning better. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- ^
This article gives some examples of more conservative predictions. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- ^
The source for this claim is this 2025 report from the Association for the Advancement of Artificial Intelligence. This comment has more details. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- ^
I gave an example in a comment here. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- ^
- ^
After making this post, I found this paper that looks at the productivity impact of LLMs on people working in customer support. I pull an interesting quote from the study in this comment. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- ^
This last paragraph with my âacknowledgementsâ is a joke, but the rest of the post isnât a joke. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
- AGI by 2032 is exÂtremely unlikely by (16 Oct 2025 22:50 UTC; 24 points)
- 's comment on OpenAI Alums, NoÂbel LauÂreÂates Urge RegÂuÂlaÂtors to Save ComÂpanyâs NonÂprofit Structure by (24 Apr 2025 4:59 UTC; 13 points)
- 's comment on Why I am Still SkepÂtiÂcal about AGI by 2030 by (10 May 2025 0:47 UTC; 11 points)
- 's comment on Neel Nandaâs Quick takes by (15 Apr 2025 3:27 UTC; 6 points)
- 's comment on AI can solve all EA probÂlems, so why keep foÂcusÂing on them? by (4 May 2025 17:01 UTC; 5 points)
- 's comment on I am sad. by (14 Apr 2025 23:40 UTC; 5 points)
- 's comment on EffecÂtive alÂtruÂism in the age of AGI by (15 Oct 2025 19:37 UTC; 2 points)
- 's comment on How Well Does RL Scale? by (4 Nov 2025 14:19 UTC; 1 point)
- 's comment on AI may atÂtain huÂman level soon by (23 Apr 2025 22:54 UTC; 1 point)
- 's comment on Three polls: on timelines and cause prio by (28 Apr 2025 16:23 UTC; 1 point)
- 's comment on Chris Leongâs Quick takes by (21 Apr 2025 16:51 UTC; -1 points)
- 's comment on YarÂrowâs Quick takes by (18 Nov 2025 4:19 UTC; -4 points)
Mooreâs law is ~1 doubling every 2 years. Barnesâ law is ~4 doublings every 2 years:
I think if you surveyed any expert on LLMs and asked them âwhich was a greater jump in capabilities, Gpt2 to GPT3 or GPT3 to GPT4?â the vast majority would say the former, and I would agree with them. This graph doesnât capture that, which makes me cautious about overelying on it.
Thatâs a really broad question though. If you asked something like, which system unlocked the most real-world value in coding, people would probably say the jump to a more recent model like o3-mini or Gemini 2.5
You could similarly argue the jump from infant to toddler is much more profound in terms of general capabilities than college student to phd but the latter is more relevant in terms of unlocking new research tasks that can be done.
I would be curious to know what the best benchmarks are which show a sub-Mooreâs-law trend.
Hi Ben. Is there any bet you would be willing to make about the impact of AI on large scale outcomes, like global catastrophes, unemployment, economic growth, or energy consumption? I am open to bets against short AI timelines, or what they supposedly imply, up to 10 k$.
Pay attention to the rest of that paragraph you quoted from:
Measuring intelligence is hard. On the wrong benchmark, a calculator is superintelligent. And yet a calculator lacks what we talk about when we talk about human intelligence, animal intelligence, and hypothetical future artificial general intelligence, like the robots and androids and sentient supercomputers that populate sci-fi.
I donât think ARC-AGI-2 is some perfect encapsulation of the essence of intelligence. Itâs more or less a puzzle game. But itâs refreshing in that it does more than many benchmarks in teasing out some of the differences in intellectual capability between present-day deep neural networks and ordinary humans.
ARC-AGI-2 does not attempt to be a test of whether an AI system is an AGI or not. Itâs intended to be a low bar for AI systems to clear. The idea is to make it easy enough for AI systems that they have some hope of getting a high score within the next few years because the goal is to move AI research forward (and not just prove a point about artificial intelligence vs. human intelligence or something like that). So, getting a high score on ARC-AGI-2 would show incremental progress toward AGI; not getting a high score on ARC-AGI-2 over the next several years would show slow progress or a lack of progress toward AGI. (No result, even a score of 100%, as cool and impressive as that would be, would show that an AI system is AGI.)
Badly operationalizing a concept like âintelligenceâ is worse than not operationalizing it at all. If you operationalize âhappinessâ as âthe number of times a person smiles per dayâ, youâve actually gone backwards in your understanding of happiness and would have been better off sticking to a looser, more nebulous conceptualization. To the extent we want to measure such complex and puzzling phenomena, we need really carefully designed measurement tools.
When weâre measuring AI, the selection of which tasks weâre evaluating on really matters. On the sort of tasks that frontier AI models struggle with, the length of tasks that AI can successfully do has not been reliably doubling. If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line. These are the results:
GPT-2: 0.0%
GPT-3: 0.0%
GPT-3.5: 0.0%
GPT-4: 0.0%
GPT-4o: 0.0%
GPT-4.5: 0.0%
o3-mini-high: 0.0%
Itâs only with the o3-low and o1-pro models we see scores above 0% â but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. Thereâs a nuanced discussion to be had about that topic. But I donât see how you could use these results to draw a trendline of AI models rapidly barrelling toward AGI.
⌠which is what (super)-exponential growth looks like, yes?
Specifically: Weâve gone from o1 (low) getting 0.8% to o3 (low) getting 4% in ~1 year, which is ~2 doublings per year (i.e. 4x Mooreâs law). Forecasting from this few data points sure seems like a cursed endeavor to me, but if you want to do it then I donât see how you can rule out Mooreâs-law-or-faster growth.
By some accounts, growth from 0.0 to 4.0 is infinite growth, which is infinitely faster than Mooreâs law!
More seriously, I didnât really think through precisely whether artificial intelligence could be increasing faster than Mooreâs law. I guess in theory it could. I forgot that Mooreâs law speed actually isnât that impressive on its own. It has to compound over decades to be impressive.
If I eat a sandwich today and eat two sandwiches tomorrow, the growth rate in my sandwich consumption is astronomically faster than Mooreâs law. But what matters is if the growth rate continues and compounds long-term.
The bigger picture is how to measure general intelligence or âfluid intelligenceâ in a way that makes sense. The Elo rating of AlphaGo probably increased faster than Mooreâs law from 2014 to 2017. But we donât see the Elo rating of AlphaGo as a measure of AGI, or else AGI would have already been achieved in 2015.
I think essentially all of these benchmarks and metrics for LLM performance are like the Elo rating of AlphaGo in this respect. They are measuring a narrow skill.
Fair enough, but in that case I feel kind of confused about what your statement âProgress does not seem like a fast exponential trend, faster than Mooreâs lawâ was intended to imply.
If the claim you are making is âAGI by 2030 will require some growth faster than Mooreâs lawâ then the good news is that almost everyone agrees with you but the bad news is that everyone already agrees with you so this point is not really cruxy to anyone.
Maybe you have an additional claim like â...and growth faster than mooreâs law is unlikely?â If so, I would encourage you to write that because I think that is the kind of thing that would engage with peopleâs cruxes!
So, what I originally wrote is:
To remove the confusing part about Mooreâs law, I could re-word it like this:
I think this conveys my meaning better than what I wrote originally, and it avoids getting into the Mooreâs law topic.
The Mooreâs law topic is a bit of an unnecessary rabbit hole. A lot of things increase faster than Mooreâs law during a short window of time, but few increase at a CAGR of 41% (or whatever Mooreâs lawâs CAGR is) for decades. Thereâs all kinds of ways to mis-apply the analogy of Mooreâs law.
People have made jokes about this kind of thing before, like The Economist sarcastically forecasting in 2006 based on then-recent trends that a 14-blade razor would be released by 2010.
I also think of David Deutschâs book The Beginning of Infinity, in which he rails against the practice of uncritically extrapolating past trends forward, and his TED Talk where he does a bit of the same.
My impression is that ARC-AGI (1) is close to being solved, which is why they brought our ARC-AGI-2 a few weeks ago.
Benchmarks are often adversarially selected so they take longer to be saturated, so I donât think little progress on ARC-AGI-2 a few weeks after release (and iirc after any major model release) tells us much at all.
It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or âfluid intelligenceâ to solve simple puzzles that pretty much any person can solve. Why is that? Isnât that interesting?
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnât that show they are lacking in the capability to generalize to novel problems? If they donât have to be specifically fine-tuned, then the timing shouldnât matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.
Another âbenchmarkâ I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that âbenchmarkâ has been much, much slower than Mooreâs law, but, then again, I donât know if anyoneâs been able to accurately measure that.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenât seen signs of anything but modest improvement over the last ~2.5 years. I also donât see many people trying to quantify those things.
On one level, that makes sense because it takes time, money/âlabour, and expertise to create a good benchmark and there is no profit in it. You donât seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%âŚ
On another level, measuring AGI progress carefully and thoughtfully seems important and itâs a bit surprising/âdisappointing that the status quo for benchmarks is so poor.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnât that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so itâs not clear that itâs pointing at a significant lack in LM capabilities. I agree that itâs weak evidence that they canât generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or âfluid intelligenceâ to solve simple puzzles that pretty much any person can solve. Why is that? Isnât that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having âfluid intelligenceââI agree that was the intention of the benchmark, and I think itâs weak evidence.
Another âbenchmarkâ I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that âbenchmarkâ has been much, much slower than Mooreâs law, but, then again, I donât know if anyoneâs been able to accurately measure that.
Has this been a lot slower than Mooreâs law? I think OpenAI revenue is, on average, more aggressive than Mooreâs law. Iâd guess that LM ability to automate intellectual work is more aggressive than Mooreâs law, too, but it started from a very low baseline, so itâs hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but đ¤ˇââď¸.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenât seen signs of anything but modest improvement over the last ~2.5 years. I also donât see many people trying to quantify those things.
Iâm curious for examples hereâparticularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesnât seem worth getting into.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAIâs customers are generating more profit for themselves by using OpenAIâs models than they were before using LLMs.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Mooreâs law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Mooreâs law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Mooreâs law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit⌠In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for âentertainmentâ. I was so surprised by this because you wouldnât expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAIâs models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAIâs API and then just re-packages OpenAIâs models for entertainment purposes, then that shouldnât count, since thatâs the same function I wanted to exclude from the beginning and the only thing thatâs different is an intermediary has been added.)
I havenât seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://ââacademic.oup.com/ââqje/ââarticle/ââ140/ââ2/ââ889/ââ7990658 The paper is open access.
Hereâs an interesting quote:
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.
This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT /â RL /â long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:
Yes the scores are still very low, but it could just be a case of the models not yet âgrokkingâ such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).
I was not being disingenuous and I find your use of the word âdisingenuousâ here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
So that we donât miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is François Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:
To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.
Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to âbrute-force program searchâ) and ARC-AGI-2 is not a perfectly designed benchmark, either.
ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. Itâs the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.
I donât know of other attempts to benchmark general intelligence (or âfluid intelligenceâ) or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.
One suggestion Iâve read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same âpuzzle gameâ (my words).
Thereâs a connection between frontier AI modelsâ failures on a relatively simple âpuzzle gameâ like ARC-AGI-2 and why we donât see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.
I understand the theory that AI will have a super fast takeoff, so that even though it isnât very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.
Itâs important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good âbenchmarksâ are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?
This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.
For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional âSoftware 1.0â was largely already adequate, is somewhat cool and impressive, but doesnât feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.
To frame the question properly would require thought, time, and research.
I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:
And the original announcement (from June 2024) says:
(And ARC-AGI 1 has now basically been solved). You say:
But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.
In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something heâs saying only now in retrospect that the ARC-AGI tasks have been mostly solved.
That first quote, from the 2019 paper, is consistent with Cholletâs January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I donât know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.
In response to the graph⌠Just showing a graph go up does not amount to a âtrajectory to automating AGI developmentâ. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPTâs release in November 2022 and today.
In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZeroâs tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.
GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for âgpt-4.5 (Pure LLM)â, and âo3-mini-high (Single CoT)â; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)
It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.
I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?
I hope that some people engage in soul searching about why they believed AGI was imminent when it wasnât. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.
I donât think itâs nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.
If this happens, we will at least know a lot more about how AGI works (or doesnât). Iâll be happy to admit Iâm wrong (I mean, Iâll be happy to still be around, for a start[1]).
I think the most likely reason we wonât have AGI in 5 years is that there will be a global moratorium on further development. This is what Iâm pushing for.
Then itâs a good thing I didnât claim there was âa trend that is all flat 0sâ in the comment you called âdisingenuousâ. I said:
This feels like such a small detail to focus on. It feels ridiculous.
??? Was this meant for Aprilâs Fools Day? Iâm confused.
No, that part was just a joke because Iâm a jokester. Thatâs the only part of the post thatâs a joke. The rest is completely serious.
I appreciated it :)
Iâm someone who doesnât think foundation models will scale to AGI. Here is my most recent field report from talking to a couple dozen AI safety /â alignment researchers at EAG bay area a couple months ago:
Practically everyone was intensely interested in why I donât think foundation models will scale to AGIâso much so that it got annoying, because I was giving the same spiel over and over, when there were many other interesting things that I kinda wanted to talk about.
There were a number of people, all quite new to the fields of AI and AI safety /â alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models wonât scale to AGI, and likewise who didnât seem to realize that the field of AI is broader than just foundation models.
There were a (quite small) number of people who generally agreed with me. These included one or two agent foundations researchers, and another person (not in the field) who thought the whole AGI thing was stupid (so then I was arguing on the other side that we should expect AGI sooner or later, and that it wasnât centuries away, even if the AGI is not a foundation model).
Putting those two groups aside, everyone else understood what I was talking about and mostly immediately had substantive counterarguments, and I had responses to those, etc.
(Iâm not sure how you distinguish between âpay lip service to listening to criticism or perform being open-mindedâ versus âare actually listening to criticism and are actually being open-minded, but are disagreeing with the criticismâ??)
Most people actually wanted to defend something weaker, like âfoundation models in conjunction with yet-to-be-invented modifications and scaffolding and whatnot will scale to AGIâ (for my part, I think this weaker claim is also wrong).
I think itâs worth distinguishing peopleâs gut beliefs from their professed probability distributions. Their professed probabilities almost always include some decent chunk in the scenario that foundation models wonât scale to AGI, but rather it will be a totally different AI paradigm. (By âdecent chunkâ I mean 10% or 20% or whatever.) But theyâre spending most of their time thinking and talking from their gut belief, forgetting the professed probabilities. (I do this too.)
Thank you for sharing your experience.
The good: it sounds like you talked to a lot of people who were eager to hear a differing opinion.
The bad: it sounds like you talked to a lot of people who had never even heard a differing opinion before and hadnât even considered that a differing opinion could exist.
I have to say, the bad part supports my observation!
When I talk about paying lip service to the idea of being open-minded vs. actually being open-minded, ultimately how you make that distinction is going to be influenced by what opinions you hold. I donât think there is a 100% impartial, objective way of making that distinction.
What I have in mind in this context when I talk about lip service vs. actual open-mindedness is stuff like how a lot of people who believe in the scaling hypothesis and short AGI timelines have ridiculed and dismissed Yann LeCun (for example here, but also so many other times before that) for saying that autoregressive LLMs will never attain AGI. If you want to listen to a well-informed, well-qualified critic, you couldnât ask for someone better than Yann LeCun, no? So, why is the response dismissal and ridicule rather than engaging with the substance of his arguments, âsteelmanningâ, and all that?
Also, when you set the two poles of the argument as people who have 1-year AGI timelines at one pole and people who have 20-year AGI timelines at the opposite pole, you really constrain the diversity of perspectives you are hearing. If you have vigorous debates with people who already broadly agree you on the broad strokes, you are hearing criticism about the details but not about the broad strokes. Thatâs a problem with insularity.
Really? Iâve never seen any substantive argument from LeCun. He mostly just presents very weak arguments (and ad hominem) on social media, that are falsified within months (e.g. his claims about LLMs not being able to world model). Please link to the best written one you know of.
I donât think itâs a good idea to engage with criticism of an idea in the form of meme videos from Reddit designed to dunk on the critic. Is that intellectually healthy?
I donât think the person who made that video or other people who want to dunk on Yann LeCun for that quote understand what he was trying to say. (Benjamin Todd recently made the same mistake here.) I think people are interpreting this quote hyper-literally and missing the broader point LeCun was trying to make.
Even today, in April 2025, models like GPT-4o and o3-mini donât have a robust understanding of things like time, causality, and the physics of everyday objects. They will routinely tell you absurd things like that an event that happened in 2024 was caused by an event in 2025, while listing the dates of the events. Why donât LLMs, still, in April 2025 consistently understand that causes precede effects and not vice versa?
If anything, this makes it seem like what LeCun said in January 2022 seem prescient. Despite a tremendous amount of scaling of training data and training compute, and, more recently, significant scaling of test-time compute, the same fundamental flaw LeCun called out over 3 years ago remains a flaw in the latest LLMs.
All that being said⌠I think even if LeCun had made the claim that I think people are mistakenly interpreting him as making and he had turned out to have been wrong about that, discrediting him based on him being wrong about that one thing would be ridiculously uncharitable.
Steven was responding to this:
None of Stevenâs bullet points support this. Many of them say the exact opposite of this.
Unless I misinterpreted what Steven was trying to say, this supports my observation in the OP about insularity:
How could you possibly never encounter the view that âfoundation models wonât scale to AGIâ? How could an intellectually healthy community produce this outcome?
Thereâs a popular mistake these days of assuming that LLMs are the entirety of AI, rather than a subfield of AI.
If you make this mistake, then you can go from there to either of two faulty conclusions:
(Faulty inference 1) Transformative AI will happen sooner or later [true IMO] THEREFORE LLMs will scale to TAI [false IMO]
(Faulty inference 2) LLMs will never scale to TAI [true IMO] THEREFORE TAI will never happen [false IMO]
I have seen an awful lot of both (1) and (2), including by e.g. CS professors who really ought to know better (example), and I try to call out both of them when I see them.
You yourself seem mildly guilty of something-like-(2), in this very post. Otherwise you would be asking questions like âhow quickly can AI paradigms go FROM obscure and unimpressive arxiv papers that nobody has heard of, TO a highly-developed technique subject to untold billions of dollars and millions of person-hours of investment?â, and youâd notice that an answer like â5 yearsâ is not out of the question. (See second half of this comment.)
Iâm not sure how you define âimminentâ in the OP title, but FWIW, LLM skeptic Yann LeCun says human-level AI âwill take several years if not a decadeâŚ[but with] a long tailâ, and LLM skeptic Francois Chollet says 2038-2048.
You had never thought through âwhether artificial intelligence could be increasing faster than Mooreâs law.â Should we conclude that AI risk skeptics are âinsular, intolerant of disagreement or intellectual or social non-conformity (relative to the groupâs norms), and closed-off to even reasonable, relatively gentle criticism?â
That seems like a non-sequitur and it seems like a calculated insult and not a good faith effort to engage in the substance of my argument.
This is very uncharitable. Especially in light of the recent AI 2027 report, which goes into a huge amount of detail (see also all the research supplements).
There is a good post about the AI 2027 report here. I do not think I am being uncharitable.
In another comment you accuse me of being âunnecessarily hostileâ. Yet to me, your whole paragraph in the OP here is unnecessarily hostile (somewhat triggering, even):
Calling that sentence uncharitable was an understatement.
For instance, you donât acknowledge that the top 3 most cited AI scientists of all time, all have relatively short timelines now.
As for the post you link, it starts with âI have not read the whole thing in detailâ. I think far too many people critiquing it have not actually read it properly. If they did read it all in detail, they might find that their objections have been answered in one of the many footnotes, appendices, and accompanying research reports. It concludes with âIt doesnât really engage with my main objections, nor is it trying to do soâ, but nowhere are the main objections actually stated! Itâs all just meta commentary.
I think you are misusing the concept of charity. Or maybe we just disagree on what it means to be charitable or uncharitable in this context because we strongly disagree on the subject matter.
You linked to the website for Ilya Sutskeverâs company as a citation for the claim that Ilya Sutskever has a relatively short AGI timeline. The website doesnât mention a timeline and I canât find an instance of Ilya Sutskever mentioning a specific timeline.
Yoshua Bengio gave a timeline of 5 to 20 years in 2023, so thatâs 3 to 18 years now. He says heâs 95% confident in this prediction. Okay.
Geoffrey Hinton also says 5 to 20 years, but only with 50% confidence. Hmm. Well, 95% vs. 50% is a big discrepancy, right? Also, heâs been saying â5 to 20 yearsâ since 2023, which, if we just take that at face value, means heâs actually been pushing back his timeline by about 1-2 years over the past 1-2 years.
I think the person who wrote the Tumblr post is pretty clear on what their problem with the AI 2027 report is. To treat the report as an actual prediction about the future, it requires you to be on board with a lot of modelling assumptions. And if youâre not already on board with those modelling assumptions, the report doesnât do much to try to convince you. The post gives a specific example of this: the âsoftware intelligence explosionâ concept.
Ilyaâs company website says âSuperintelligence is within reach.â I think itâs reasonable to interpret that as having a short timeline. If not an even stronger claim that he thinks he knows how to actually build it.
Right, and doesnât address any of the meat in the methodology section.
Looking at the methodology section you linked to, this really just confirms the accuracy of nostalgebraistâs critique, for me. (nostalgebraist is the Tumblr blogger.) There are a lot of guesses and intuitions. Such as:
Okay? Iâm not necessarily saying this is an unreasonable opinion. I donât really know. But this is fundamentally a process of turning intuitions into numbers and turning numbers into a mathematical model. The mathematical model doesnât make the intuitions any more (or less) correct.
Why not 2-15 months? Why not 20-150 years? Why not 4-30 years? Itâs ultimately about what the authors intuitively find plausible. Other well-informed people could reasonably find very different numbers plausible.
And if you swap out more of the authorsâ intuitions for other peopleâs intuitions, the end result might be AGI in 2047 or 2077 or 2177 instead of 2027.
Edit: While looking up something else, I found this paper which attempts a similar sort of exercise as the AI 2027 report and gets a very different result.
This is an example of the multiple stages fallacy (as pointed out here), where you can get arbitrarily low probabilities for anything by dividing it up enough and assuming things are uncorrelated.
I donât find accusations of fallacy helpful here. The authorâs say in the abstract explicitly that they estimated the probability of each step conditional on the previous ones. So they are not making a simple, formal error like multiplying a bunch of unconditional probabilities whilst forgetting that only works if the probabilities are uncorrelated. Rather, you and Richard Ngo think that theyâre estimates for the explicitly conditional probabilities are too low, and you are speculating that this is because they are still really think of the unconditional probabilities. But I donât think âyou are committing a fallacyâ is a very good or fair way to describe âI disagree with your probabilities and I have some unevidenced speculation about why you are giving probabilities that are wrongâ.
Saying they are conditional does not mean they are. For example, why is P(We invent a way for AGIs to learn faster than humans|We invent algorithms for transformative AGI) only 40%? Or P(AGI inference costs drop below $25/âhr (per human equivalent)[1]|We invent algorithms for transformative AGI) only 16%!? These would be much more reasonable as unconditional probabilities. At the very least, âalgorithms for transformative AGIâ would be used to massively increase software and hardware R&D, even if expensive at first, such that inference costs would quickly drop.
As an aside, surely this milestone has basically now already been reached? At least for the 90% percentile human in most intellectual tasks.
I donât think you can possibly know whether they really are actually thinking of the unconditional probabilities or whether they just have very different opinions and instincts from you about the whole domain which make very different genuinely conditional probabilities seem reasonable.
It just looks a lot like motivated reasoning to meâkind of like they started with the conclusion and worked backward. Those examples are pretty unreasonable as conditional probabilities. Do they explain why âalgorithms for transformative AGIâ are very unlikely to meaningfully speed up software and hardware R&D?
One of the authors responds to the comment you linked to and says he was already aware of the concept of the multiple stages fallacy when writing the paper.
But the point I was making in my comment above is how easy it is for reasonable, informed people to generate different intuitions that form the fundamental inputs of a forecasting model like AI 2027. For example, the authors intuit that something would take years, not decades, to solve. Someone else could easily intuit it will take decades, not years.
The same is true for all the different intuitions the model relies on to get to its thrilling conclusion.
Since the model can only exist by using many such intuitions as inputs, ultimately the model is effectively a re-statement of these intuitions, and putting these intuitions into a model doesnât make them any more correct.
In 2-3 years, when it turns out the prediction of AGI in 2027 is wrong, it probably wonât be because of a math error in the model but rather because the intuitions the model is based on are wrong.
If they were already aware, they certainly didnât do anything to address it, given their conclusion is basically a result of falling for it.
Itâs more than just intuitions, itâs grounded in current research and recent progress in (proto) AGI. To validate the opposing intuitions (long timelines) requires more in the way of leaps of faith (to say that things will suddenly stop working as they have been). Longer timelines intuitions have also been proven wrong consistently over the last few years (e.g. AI constantly doing things people predicted were âdecades awayâ just a few years, or even months, before).
-If you look at easy benchmarks like ARC-AGI and ARC-AGI-2 that are easy for humans to solve and intentionally designed to be a low bar for AI to clear, the weaknesses of frontier AI models are starkly revealed.
I donât think they are designed to be a low bar to clear. They seem very adversarially selected, though I agree that LMs do poorly on them relative to subjectively more difficult tasks like coding. It seems pretty hard to make a timelines update from ARC-AGI unless you are very confident in the importance of abstract shape rotation problems for much more concrete problems, or you care about some notion of âintelligenceâ much more than automating intellectual labour.
Based on what?
This is what François Chollet said about ARC-AGI in a post on Bluesky from January 6, 2025:
On Dwarkesh Patelâs podcast, Chollet emphasized that pretty much anybody can solve ARC-AGI puzzles, even children.
Youâve got to measure something and the most commonly cited benchmarks for LLMs mostly seem to measure memorizing large quantities of text with very limited generalization to novel chunks of text. Thatâs cool, but I donât think itâs measuring general intelligence.
ARC-AGI and the new and improved ARC-AGI-2 are specifically designed to measure progress toward AGI by focusing on capabilities that humans have and AI doesnât. I donât know if it succeeds in measuring general intelligence, but I find it a lot more interesting than the benchmarks that reward memorizing text.
I think it would be a good idea for others to take inspiration from ARC-AGI-2 and design new benchmarks that specifically focus on what humans can do ~100% of the time and what AI can do ~0% of the time. If you donât try to measure this, and you arenât really careful and thoughtful in how you measure it, you risk ending up with distorted conclusions about AGI progress.
-Most AI experts are skeptical that scaling up LLMs could lead to AGI.
I donât think this is true. Do you have a source? My guess is that I wouldnât consider many of the people âexpertsâ.
-It seems like there are deep, fundamental scientific discoveries and breakthroughs that would need to be made for building AGI to become possible. There is no evidence weâre on the cusp of those happening and it seems like they could easily take many decades.
I think this is a pretty strange take. It seems like basically all progress on AI has involved approximately 0 âdeep, fundamental scientific discoveriesâ, so I think you need some argument for why the trend will change. Alternatively, if you think we have made lots of discoveries and that explains AI progress so far, then you need an argument for why these discoveries will stop. Or, if you think we have made little AI progress since ~2010 then I think most readers would strongly disagree with you.
The source is a report from the Association for the Advancement of Artificial Intelligence (AAAI): https://ââaaai.org/ââwp-content/ââuploads/ââ2025/ââ03/ââAAAI-2025-PresPanel-Report-Digital-3.7.25.pdf
Page 7 discusses who they surveyed:
Page 63 discusses the question about scaling:
I have sources for the other specific claims made in the post as well and will provide them on request, but they also should be pretty easy to look up.
I think itâs a pretty normal take. If you want to hear the version from a person who won a Turing Award for their contributions to AI, listen to Yann LeCun talk about it. Hereâs a recent representative example: https://ââwww.pymnts.com/ââartificial-intelligence-2/ââ2025/ââmeta-large-language-models-will-not-get-to-human-level-intelligence/ââ
Heâs given lots of talks and interviews where he goes into detail.
Thanks for the link, I havenât come across that report before.
I think Yann has pretty atypical views for people working on LMs. For example, if you take the reference classes of AI-related Turing award winners or Chief scientist types at AI labs, most are far more bullish on LMs (e.g., Hinton, Bengio, Ilya, Jared Kaplan, Schulman).
Let me repeat something I said in the OP:
My impression is that a lot of people who believe in short AGI timelines (e.g. AGI by January 1, 2030) and who believe in some strong version of the scaling hypothesis (e.g. LLMs will scale to AGI with relatively minor fundamental changes but with greatly increased training compute, inference compute, and/âor training data) are in an echo chamber where they just reinforce each otherâs ideas all the time.
What might look like vigorous disagreement is, in many cases, when you zoom out, people with broadly similar views arguing around the margins (e.g. AGI in 3 years vs. 7 years; minimal non-scaling innovations on LLMs vs. modest non-scaling innovations on LLMs).
If people stop to briefly consider what a well-informed critic like Yann LeCun has to say about the topic, itâs usually to make fun of him and move on.
It will seem more obvious that youâre right if the people you choose to listen to are the people who broadly agree with you and if you meet well-informed disagreement from people like Yann Lecun or François Chollet with dismissal, ridicule, or hostility. This is a recipe for overconfidence. Taken to an extreme, this approach can lead people down a path where they end up deeply misguided.
Yesterday, I watched this talk by François Chollet, which provides support for a few of the assertions I made in this post.