I think if you surveyed any expert on LLMs and asked them āwhich was a greater jump in capabilities, Gpt2 to GPT3 or GPT3 to GPT4?ā the vast majority would say the former, and I would agree with them. This graph doesnāt capture that, which makes me cautious about overelying on it.
Thatās a really broad question though. If you asked something like, which system unlocked the most real-world value in coding, people would probably say the jump to a more recent model like o3-mini or Gemini 2.5
You could similarly argue the jump from infant to toddler is much more profound in terms of general capabilities than college student to phd but the latter is more relevant in terms of unlocking new research tasks that can be done.
Hi Ben. Is there any bet you would be willing to make about the impact of AI on large scale outcomes, like global catastrophes, unemployment, economic growth, or energy consumption? I am open tobetsagainst short AI timelines, or what they supposedly imply, up to 10 k$.
Pay attention to the rest of that paragraph you quoted from:
Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably happening faster than Mooreās law, but not the actual intelligence of the models.
Measuring intelligence is hard. On the wrong benchmark, a calculator is superintelligent. And yet a calculator lacks what we talk about when we talk about human intelligence, animal intelligence, and hypothetical future artificial general intelligence, like the robots and androids and sentient supercomputers that populate sci-fi.
I donāt think ARC-AGI-2 is some perfect encapsulation of the essence of intelligence. Itās more or less a puzzle game. But itās refreshing in that it does more than many benchmarks in teasing out some of the differences in intellectual capability between present-day deep neural networks and ordinary humans.
ARC-AGI-2 does not attempt to be a test of whether an AI system is an AGI or not. Itās intended to be a low bar for AI systems to clear. The idea is to make it easy enough for AI systems that they have some hope of getting a high score within the next few years because the goal is to move AI research forward (and not just prove a point about artificial intelligence vs. human intelligence or something like that). So, getting a high score on ARC-AGI-2 would show incremental progress toward AGI; not getting a high score on ARC-AGI-2 over the next several years would show slow progress or a lack of progress toward AGI. (No result, even a score of 100%, as cool and impressive as that would be, would show that an AI system is AGI.)
Badly operationalizing a concept like āintelligenceā is worse than not operationalizing it at all. If you operationalize āhappinessā as āthe number of times a person smiles per dayā, youāve actually gone backwards in your understanding of happiness and would have been better off sticking to a looser, more nebulous conceptualization. To the extent we want to measure such complex and puzzling phenomena, we need really carefully designed measurement tools.
When weāre measuring AI, the selection of which tasks weāre evaluating on really matters. On the sort of tasks that frontier AI models struggle with, the length of tasks that AI can successfully do has not been reliably doubling. If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line. These are the results:
Itās only with the o3-low and o1-pro models we see scores above 0% ā but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. Thereās a nuanced discussion to be had about that topic. But I donāt see how you could use these results to draw a trendline of AI models rapidly barrelling toward AGI.
If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line.. Itās only with the o3-low and o1-pro models we see scores above 0%
⦠which is what (super)-exponential growth looks like, yes?
Specifically: Weāve gone from o1 (low) getting 0.8% to o3 (low) getting 4% in ~1 year, which is ~2 doublings per year (i.e. 4x Mooreās law). Forecasting from this few data points sure seems like a cursed endeavor to me, but if you want to do it then I donāt see how you can rule out Mooreās-law-or-faster growth.
By some accounts, growth from 0.0 to 4.0 is infinite growth, which is infinitely faster than Mooreās law!
More seriously, I didnāt really think through precisely whether artificial intelligence could be increasing faster than Mooreās law. I guess in theory it could. I forgot that Mooreās law speed actually isnāt that impressive on its own. It has to compound over decades to be impressive.
If I eat a sandwich today and eat two sandwiches tomorrow, the growth rate in my sandwich consumption is astronomically faster than Mooreās law. But what matters is if the growth rate continues and compounds long-term.
The bigger picture is how to measure general intelligence or āfluid intelligenceā in a way that makes sense. The Elo rating of AlphaGo probably increased faster than Mooreās law from 2014 to 2017. But we donāt see the Elo rating of AlphaGo as a measure of AGI, or else AGI would have already been achieved in 2015.
I think essentially all of these benchmarks and metrics for LLM performance are like the Elo rating of AlphaGo in this respect. They are measuring a narrow skill.
More seriously, I didnāt really think through precisely whether artificial intelligence could be increasing faster than Mooreās law.
Fair enough, but in that case I feel kind of confused about what your statement āProgress does not seem like a fast exponential trend, faster than Mooreās lawā was intended to imply.
If the claim you are making is āAGI by 2030 will require some growth faster than Mooreās lawā then the good news is that almost everyone agrees with you but the bad news is that everyone already agrees with you so this point is not really cruxy to anyone.
Maybe you have an additional claim like ā...and growth faster than mooreās law is unlikely?ā If so, I would encourage you to write that because I think that is the kind of thing that would engage with peopleās cruxes!
Progress does not seem like a fast exponential trend, faster than Mooreās law and laying the groundwork for an intelligence explosion. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably happening faster than Mooreās law, but not the actual intelligence of the models.
To remove the confusing part about Mooreās law, I could re-word it like this:
Progress toward AGI does not seem very fast, not fast enough to lay the groundwork for an intelligence explosion within anything like 5 years. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably very large, but the actual intelligence of the models seems to be increasing only a bit with each new major version.
I think this conveys my meaning better than what I wrote originally, and it avoids getting into the Mooreās law topic.
The Mooreās law topic is a bit of an unnecessary rabbit hole. A lot of things increase faster than Mooreās law during a short window of time, but few increase at a CAGR of 41% (or whatever Mooreās lawās CAGR is) for decades. Thereās all kinds of ways to mis-apply the analogy of Mooreās law.
People have made jokes about this kind of thing before, like The Economist sarcastically forecasting in 2006 based on then-recent trends that a 14-blade razor would be released by 2010.
I also think of David Deutschās book The Beginning of Infinity, in which he rails against the practice of uncritically extrapolating past trends forward, and his TED Talk where he does a bit of the same.
My impression is that ARC-AGI (1) is close to being solved, which is why they brought our ARC-AGI-2 a few weeks ago.
Benchmarks are often adversarially selected so they take longer to be saturated, so I donāt think little progress on ARC-AGI-2 a few weeks after release (and iirc after any major model release) tells us much at all.
It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems? If they donāt have to be specifically fine-tuned, then the timing shouldnāt matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
On one level, that makes sense because it takes time, money/ālabour, and expertise to create a good benchmark and there is no profit in it. You donāt seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%ā¦
On another level, measuring AGI progress carefully and thoughtfully seems important and itās a bit surprising/ādisappointing that the status quo for benchmarks is so poor.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so itās not clear that itās pointing at a significant lack in LM capabilities. I agree that itās weak evidence that they canāt generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having āfluid intelligenceāāI agree that was the intention of the benchmark, and I think itās weak evidence.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
Has this been a lot slower than Mooreās law? I think OpenAI revenue is, on average, more aggressive than Mooreās law. Iād guess that LM ability to automate intellectual work is more aggressive than Mooreās law, too, but it started from a very low baseline, so itās hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but š¤·āāļø.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
Iām curious for examples hereāparticularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesnāt seem worth getting into.
I think OpenAI revenue is, on average, more aggressive than Mooreās law.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
the ability of AI systems to generate profit for their users by displacing human labour.
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAIās customers are generating more profit for themselves by using OpenAIās models than they were before using LLMs.
Iād guess that LM ability to automate intellectual work is more aggressive than Mooreās law, too, but it started from a very low baseline, so itās hard to see.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Mooreās law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Mooreās law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Mooreās law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit⦠In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for āentertainmentā. I was so surprised by this because you wouldnāt expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAIās models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAIās API and then just re-packages OpenAIās models for entertainment purposes, then that shouldnāt count, since thatās the same function I wanted to exclude from the beginning and the only thing thatās different is an intermediary has been added.)
In Figure III, Panels BāE we show that less skilled agents consistently see the largest gains across our other outcomes as well. For the highest-skilled workers, we find mixed results: a zero effect on AHT [Average Handle Time] (Panel B); a small but positive effect for CPH [Chats Per Hour] (Panel C); and, interestingly, small but statistically significant decreases in RRs [Resolution Rates] and customer satisfaction (Panels D and E).
These results are consistent with the idea that generative AI tools may function by exposing lower-skill workers to the best practices of higher-skill workers. Lower-skill workers benefit because AI assistance provides new solutions, whereas the best performers may see little benefit from being exposed to their own best practices. Indeed, the negative effects along measures of chat qualityāRR and customer satisfactionāsuggest that AI recommendations may distract top performers or lead them to choose the faster or less cognitively taxing option (following suggestions) rather than taking the time to come up with their own responses. Addressing this outcome is potentially important because the conversations of top agents are used for ongoing AI training.
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.
This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT /ā RL /ā long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:
Yes the scores are still very low, but it could just be a case of the models not yet āgrokkingā such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).
I was not being disingenuous and I find your use of the word ādisingenuousā here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
So that we donāt miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is FranƧois Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:
I donāt think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.
It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
Passing it means your system exhibits non-zero fluid intelligenceāyouāre finally looking at something that isnāt pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.
Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to ābrute-force program searchā) and ARC-AGI-2 is not a perfectly designed benchmark, either.
ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. Itās the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.
I donāt know of other attempts to benchmark general intelligence (or āfluid intelligenceā) or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.
One suggestion Iāve read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same āpuzzle gameā (my words).
Thereās a connection between frontier AI modelsā failures on a relatively simple āpuzzle gameā like ARC-AGI-2 and why we donāt see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.
I understand the theory that AI will have a super fast takeoff, so that even though it isnāt very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.
Itās important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good ābenchmarksā are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?
This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.
For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional āSoftware 1.0ā was largely already adequate, is somewhat cool and impressive, but doesnāt feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.
To frame the question properly would require thought, time, and research.
I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:
āWe argue that ARC [ARC-AGI 1] can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.ā
And the original announcement (from June 2024) says:
A solution to ARC-AGI [1], at a minimum, opens up a completely new programming paradigm where programs can perfectly and reliably generalize from an arbitrary set of priors. We also believe a solution is on the critical path towards AGIā
(And ARC-AGI 1 has now basically been solved). You say:
I understand the theory that AI will have a super fast takeoff, so that even though it isnāt very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present.
But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.
In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something heās saying only now in retrospect that the ARC-AGI tasks have been mostly solved.
That first quote, from the 2019 paper, is consistent with Cholletās January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I donāt know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.
In response to the graph⦠Just showing a graph go up does not amount to a ātrajectory to automating AGI developmentā. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPTās release in November 2022 and today.
In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZeroās tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.
I was not being disingenuous and I find your use of the word ādisingenuousā here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for āgpt-4.5 (Pure LLM)ā, and āo3-mini-high (Single CoT)ā; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)
It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.
I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?
I hope that some people engage in soul searching about why they believed AGI was imminent when it wasnāt. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.
I donāt think itās nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.
I am curious to see what will happen in 5 years when there is no AGI.
If this happens, we will at least know a lot more about how AGI works (or doesnāt). Iāll be happy to admit Iām wrong (I mean, Iāll be happy to still be around, for a start[1]).
I think the most likely reason we wonāt have AGI in 5 years is that there will be a global moratorium on further development. This is what Iām pushing for.
A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s
Then itās a good thing I didnāt claim there was āa trend that is all flat 0sā in the comment you called ādisingenuousā. I said:
Itās only with the o3-low and o1-pro models we see scores above 0% ā but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. Thereās a nuanced discussion to be had about that topic.
This feels like such a small detail to focus on. It feels ridiculous.
Mooreās law is ~1 doubling every 2 years. Barnesā law is ~4 doublings every 2 years:
I think if you surveyed any expert on LLMs and asked them āwhich was a greater jump in capabilities, Gpt2 to GPT3 or GPT3 to GPT4?ā the vast majority would say the former, and I would agree with them. This graph doesnāt capture that, which makes me cautious about overelying on it.
Thatās a really broad question though. If you asked something like, which system unlocked the most real-world value in coding, people would probably say the jump to a more recent model like o3-mini or Gemini 2.5
You could similarly argue the jump from infant to toddler is much more profound in terms of general capabilities than college student to phd but the latter is more relevant in terms of unlocking new research tasks that can be done.
I would be curious to know what the best benchmarks are which show a sub-Mooreās-law trend.
Hi Ben. Is there any bet you would be willing to make about the impact of AI on large scale outcomes, like global catastrophes, unemployment, economic growth, or energy consumption? I am open to bets against short AI timelines, or what they supposedly imply, up to 10 k$.
Pay attention to the rest of that paragraph you quoted from:
Measuring intelligence is hard. On the wrong benchmark, a calculator is superintelligent. And yet a calculator lacks what we talk about when we talk about human intelligence, animal intelligence, and hypothetical future artificial general intelligence, like the robots and androids and sentient supercomputers that populate sci-fi.
I donāt think ARC-AGI-2 is some perfect encapsulation of the essence of intelligence. Itās more or less a puzzle game. But itās refreshing in that it does more than many benchmarks in teasing out some of the differences in intellectual capability between present-day deep neural networks and ordinary humans.
ARC-AGI-2 does not attempt to be a test of whether an AI system is an AGI or not. Itās intended to be a low bar for AI systems to clear. The idea is to make it easy enough for AI systems that they have some hope of getting a high score within the next few years because the goal is to move AI research forward (and not just prove a point about artificial intelligence vs. human intelligence or something like that). So, getting a high score on ARC-AGI-2 would show incremental progress toward AGI; not getting a high score on ARC-AGI-2 over the next several years would show slow progress or a lack of progress toward AGI. (No result, even a score of 100%, as cool and impressive as that would be, would show that an AI system is AGI.)
Badly operationalizing a concept like āintelligenceā is worse than not operationalizing it at all. If you operationalize āhappinessā as āthe number of times a person smiles per dayā, youāve actually gone backwards in your understanding of happiness and would have been better off sticking to a looser, more nebulous conceptualization. To the extent we want to measure such complex and puzzling phenomena, we need really carefully designed measurement tools.
When weāre measuring AI, the selection of which tasks weāre evaluating on really matters. On the sort of tasks that frontier AI models struggle with, the length of tasks that AI can successfully do has not been reliably doubling. If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line. These are the results:
GPT-2: 0.0%
GPT-3: 0.0%
GPT-3.5: 0.0%
GPT-4: 0.0%
GPT-4o: 0.0%
GPT-4.5: 0.0%
o3-mini-high: 0.0%
Itās only with the o3-low and o1-pro models we see scores above 0% ā but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. Thereās a nuanced discussion to be had about that topic. But I donāt see how you could use these results to draw a trendline of AI models rapidly barrelling toward AGI.
⦠which is what (super)-exponential growth looks like, yes?
Specifically: Weāve gone from o1 (low) getting 0.8% to o3 (low) getting 4% in ~1 year, which is ~2 doublings per year (i.e. 4x Mooreās law). Forecasting from this few data points sure seems like a cursed endeavor to me, but if you want to do it then I donāt see how you can rule out Mooreās-law-or-faster growth.
By some accounts, growth from 0.0 to 4.0 is infinite growth, which is infinitely faster than Mooreās law!
More seriously, I didnāt really think through precisely whether artificial intelligence could be increasing faster than Mooreās law. I guess in theory it could. I forgot that Mooreās law speed actually isnāt that impressive on its own. It has to compound over decades to be impressive.
If I eat a sandwich today and eat two sandwiches tomorrow, the growth rate in my sandwich consumption is astronomically faster than Mooreās law. But what matters is if the growth rate continues and compounds long-term.
The bigger picture is how to measure general intelligence or āfluid intelligenceā in a way that makes sense. The Elo rating of AlphaGo probably increased faster than Mooreās law from 2014 to 2017. But we donāt see the Elo rating of AlphaGo as a measure of AGI, or else AGI would have already been achieved in 2015.
I think essentially all of these benchmarks and metrics for LLM performance are like the Elo rating of AlphaGo in this respect. They are measuring a narrow skill.
Fair enough, but in that case I feel kind of confused about what your statement āProgress does not seem like a fast exponential trend, faster than Mooreās lawā was intended to imply.
If the claim you are making is āAGI by 2030 will require some growth faster than Mooreās lawā then the good news is that almost everyone agrees with you but the bad news is that everyone already agrees with you so this point is not really cruxy to anyone.
Maybe you have an additional claim like ā...and growth faster than mooreās law is unlikely?ā If so, I would encourage you to write that because I think that is the kind of thing that would engage with peopleās cruxes!
So, what I originally wrote is:
To remove the confusing part about Mooreās law, I could re-word it like this:
I think this conveys my meaning better than what I wrote originally, and it avoids getting into the Mooreās law topic.
The Mooreās law topic is a bit of an unnecessary rabbit hole. A lot of things increase faster than Mooreās law during a short window of time, but few increase at a CAGR of 41% (or whatever Mooreās lawās CAGR is) for decades. Thereās all kinds of ways to mis-apply the analogy of Mooreās law.
People have made jokes about this kind of thing before, like The Economist sarcastically forecasting in 2006 based on then-recent trends that a 14-blade razor would be released by 2010.
I also think of David Deutschās book The Beginning of Infinity, in which he rails against the practice of uncritically extrapolating past trends forward, and his TED Talk where he does a bit of the same.
My impression is that ARC-AGI (1) is close to being solved, which is why they brought our ARC-AGI-2 a few weeks ago.
Benchmarks are often adversarially selected so they take longer to be saturated, so I donāt think little progress on ARC-AGI-2 a few weeks after release (and iirc after any major model release) tells us much at all.
It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems? If they donāt have to be specifically fine-tuned, then the timing shouldnāt matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
On one level, that makes sense because it takes time, money/ālabour, and expertise to create a good benchmark and there is no profit in it. You donāt seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%ā¦
On another level, measuring AGI progress carefully and thoughtfully seems important and itās a bit surprising/ādisappointing that the status quo for benchmarks is so poor.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so itās not clear that itās pointing at a significant lack in LM capabilities. I agree that itās weak evidence that they canāt generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having āfluid intelligenceāāI agree that was the intention of the benchmark, and I think itās weak evidence.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
Has this been a lot slower than Mooreās law? I think OpenAI revenue is, on average, more aggressive than Mooreās law. Iād guess that LM ability to automate intellectual work is more aggressive than Mooreās law, too, but it started from a very low baseline, so itās hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but š¤·āāļø.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
Iām curious for examples hereāparticularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesnāt seem worth getting into.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAIās customers are generating more profit for themselves by using OpenAIās models than they were before using LLMs.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Mooreās law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Mooreās law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Mooreās law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit⦠In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for āentertainmentā. I was so surprised by this because you wouldnāt expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAIās models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAIās API and then just re-packages OpenAIās models for entertainment purposes, then that shouldnāt count, since thatās the same function I wanted to exclude from the beginning and the only thing thatās different is an intermediary has been added.)
I havenāt seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://āāacademic.oup.com/āāqje/āāarticle/āā140/āā2/āā889/āā7990658 The paper is open access.
Hereās an interesting quote:
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.
This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT /ā RL /ā long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:
Yes the scores are still very low, but it could just be a case of the models not yet āgrokkingā such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).
I was not being disingenuous and I find your use of the word ādisingenuousā here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.
So that we donāt miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is FranƧois Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:
To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.
Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to ābrute-force program searchā) and ARC-AGI-2 is not a perfectly designed benchmark, either.
ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. Itās the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.
I donāt know of other attempts to benchmark general intelligence (or āfluid intelligenceā) or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.
One suggestion Iāve read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same āpuzzle gameā (my words).
Thereās a connection between frontier AI modelsā failures on a relatively simple āpuzzle gameā like ARC-AGI-2 and why we donāt see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.
I understand the theory that AI will have a super fast takeoff, so that even though it isnāt very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.
Itās important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good ābenchmarksā are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?
This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.
For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional āSoftware 1.0ā was largely already adequate, is somewhat cool and impressive, but doesnāt feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.
To frame the question properly would require thought, time, and research.
I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:
And the original announcement (from June 2024) says:
(And ARC-AGI 1 has now basically been solved). You say:
But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.
In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something heās saying only now in retrospect that the ARC-AGI tasks have been mostly solved.
That first quote, from the 2019 paper, is consistent with Cholletās January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I donāt know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.
In response to the graph⦠Just showing a graph go up does not amount to a ātrajectory to automating AGI developmentā. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPTās release in November 2022 and today.
In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZeroās tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.
GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for āgpt-4.5 (Pure LLM)ā, and āo3-mini-high (Single CoT)ā; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)
It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.
I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?
I hope that some people engage in soul searching about why they believed AGI was imminent when it wasnāt. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.
I donāt think itās nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.
If this happens, we will at least know a lot more about how AGI works (or doesnāt). Iāll be happy to admit Iām wrong (I mean, Iāll be happy to still be around, for a start[1]).
I think the most likely reason we wonāt have AGI in 5 years is that there will be a global moratorium on further development. This is what Iām pushing for.
Then itās a good thing I didnāt claim there was āa trend that is all flat 0sā in the comment you called ādisingenuousā. I said:
This feels like such a small detail to focus on. It feels ridiculous.