It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems? If they donāt have to be specifically fine-tuned, then the timing shouldnāt matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
On one level, that makes sense because it takes time, money/ālabour, and expertise to create a good benchmark and there is no profit in it. You donāt seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%ā¦
On another level, measuring AGI progress carefully and thoughtfully seems important and itās a bit surprising/ādisappointing that the status quo for benchmarks is so poor.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so itās not clear that itās pointing at a significant lack in LM capabilities. I agree that itās weak evidence that they canāt generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having āfluid intelligenceāāI agree that was the intention of the benchmark, and I think itās weak evidence.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
Has this been a lot slower than Mooreās law? I think OpenAI revenue is, on average, more aggressive than Mooreās law. Iād guess that LM ability to automate intellectual work is more aggressive than Mooreās law, too, but it started from a very low baseline, so itās hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but š¤·āāļø.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
Iām curious for examples hereāparticularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesnāt seem worth getting into.
I think OpenAI revenue is, on average, more aggressive than Mooreās law.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
the ability of AI systems to generate profit for their users by displacing human labour.
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAIās customers are generating more profit for themselves by using OpenAIās models than they were before using LLMs.
Iād guess that LM ability to automate intellectual work is more aggressive than Mooreās law, too, but it started from a very low baseline, so itās hard to see.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Mooreās law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Mooreās law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Mooreās law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit⦠In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for āentertainmentā. I was so surprised by this because you wouldnāt expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAIās models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAIās API and then just re-packages OpenAIās models for entertainment purposes, then that shouldnāt count, since thatās the same function I wanted to exclude from the beginning and the only thing thatās different is an intermediary has been added.)
In Figure III, Panels BāE we show that less skilled agents consistently see the largest gains across our other outcomes as well. For the highest-skilled workers, we find mixed results: a zero effect on AHT [Average Handle Time] (Panel B); a small but positive effect for CPH [Chats Per Hour] (Panel C); and, interestingly, small but statistically significant decreases in RRs [Resolution Rates] and customer satisfaction (Panels D and E).
These results are consistent with the idea that generative AI tools may function by exposing lower-skill workers to the best practices of higher-skill workers. Lower-skill workers benefit because AI assistance provides new solutions, whereas the best performers may see little benefit from being exposed to their own best practices. Indeed, the negative effects along measures of chat qualityāRR and customer satisfactionāsuggest that AI recommendations may distract top performers or lead them to choose the faster or less cognitively taxing option (following suggestions) rather than taking the time to come up with their own responses. Addressing this outcome is potentially important because the conversations of top agents are used for ongoing AI training.
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.
It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems? If they donāt have to be specifically fine-tuned, then the timing shouldnāt matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
On one level, that makes sense because it takes time, money/ālabour, and expertise to create a good benchmark and there is no profit in it. You donāt seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%ā¦
On another level, measuring AGI progress carefully and thoughtfully seems important and itās a bit surprising/ādisappointing that the status quo for benchmarks is so poor.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesnāt that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so itās not clear that itās pointing at a significant lack in LM capabilities. I agree that itās weak evidence that they canāt generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or āfluid intelligenceā to solve simple puzzles that pretty much any person can solve. Why is that? Isnāt that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having āfluid intelligenceāāI agree that was the intention of the benchmark, and I think itās weak evidence.
Another ābenchmarkā I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that ābenchmarkā has been much, much slower than Mooreās law, but, then again, I donāt know if anyoneās been able to accurately measure that.
Has this been a lot slower than Mooreās law? I think OpenAI revenue is, on average, more aggressive than Mooreās law. Iād guess that LM ability to automate intellectual work is more aggressive than Mooreās law, too, but it started from a very low baseline, so itās hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but š¤·āāļø.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I havenāt seen signs of anything but modest improvement over the last ~2.5 years. I also donāt see many people trying to quantify those things.
Iām curious for examples hereāparticularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesnāt seem worth getting into.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAIās customers are generating more profit for themselves by using OpenAIās models than they were before using LLMs.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Mooreās law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Mooreās law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Mooreās law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit⦠In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for āentertainmentā. I was so surprised by this because you wouldnāt expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAIās models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAIās API and then just re-packages OpenAIās models for entertainment purposes, then that shouldnāt count, since thatās the same function I wanted to exclude from the beginning and the only thing thatās different is an intermediary has been added.)
I havenāt seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://āāacademic.oup.com/āāqje/āāarticle/āā140/āā2/āā889/āā7990658 The paper is open access.
Hereās an interesting quote:
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.