Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesn’t that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so it’s not clear that it’s pointing at a significant lack in LM capabilities. I agree that it’s weak evidence that they can’t generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or “fluid intelligence” to solve simple puzzles that pretty much any person can solve. Why is that? Isn’t that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having “fluid intelligence”—I agree that was the intention of the benchmark, and I think it’s weak evidence.
Another “benchmark” I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that “benchmark” has been much, much slower than Moore’s law, but, then again, I don’t know if anyone’s been able to accurately measure that.
Has this been a lot slower than Moore’s law? I think OpenAI revenue is, on average, more aggressive than Moore’s law. I’d guess that LM ability to automate intellectual work is more aggressive than Moore’s law, too, but it started from a very low baseline, so it’s hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but 🤷♂️.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I haven’t seen signs of anything but modest improvement over the last ~2.5 years. I also don’t see many people trying to quantify those things.
I’m curious for examples here—particularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesn’t seem worth getting into.
I think OpenAI revenue is, on average, more aggressive than Moore’s law.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
the ability of AI systems to generate profit for their users by displacing human labour.
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAI’s customers are generating more profit for themselves by using OpenAI’s models than they were before using LLMs.
I’d guess that LM ability to automate intellectual work is more aggressive than Moore’s law, too, but it started from a very low baseline, so it’s hard to see.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Moore’s law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Moore’s law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Moore’s law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit… In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for “entertainment”. I was so surprised by this because you wouldn’t expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAI’s models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAI’s API and then just re-packages OpenAI’s models for entertainment purposes, then that shouldn’t count, since that’s the same function I wanted to exclude from the beginning and the only thing that’s different is an intermediary has been added.)
I haven’t seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://academic.oup.com/qje/article/140/2/889/7990658 The paper is open access.
Here’s an interesting quote:
In Figure III, Panels B–E we show that less skilled agents consistently see the largest gains across our other outcomes as well. For the highest-skilled workers, we find mixed results: a zero effect on AHT [Average Handle Time] (Panel B); a small but positive effect for CPH [Chats Per Hour] (Panel C); and, interestingly, small but statistically significant decreases in RRs [Resolution Rates] and customer satisfaction (Panels D and E).
These results are consistent with the idea that generative AI tools may function by exposing lower-skill workers to the best practices of higher-skill workers. Lower-skill workers benefit because AI assistance provides new solutions, whereas the best performers may see little benefit from being exposed to their own best practices. Indeed, the negative effects along measures of chat quality—RR and customer satisfaction—suggest that AI recommendations may distract top performers or lead them to choose the faster or less cognitively taxing option (following suggestions) rather than taking the time to come up with their own responses. Addressing this outcome is potentially important because the conversations of top agents are used for ongoing AI training.
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.
Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesn’t that show they are lacking in the capability to generalize to novel problems?
The main reason is that the benchmark has been pretty adversarially selected, so it’s not clear that it’s pointing at a significant lack in LM capabilities. I agree that it’s weak evidence that they can’t generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.
For one, it tells you that current frontier models lack the general intelligence or “fluid intelligence” to solve simple puzzles that pretty much any person can solve. Why is that? Isn’t that interesting?
I disagree that ARC-AGI is strong evidence against LMs not having “fluid intelligence”—I agree that was the intention of the benchmark, and I think it’s weak evidence.
Another “benchmark” I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that “benchmark” has been much, much slower than Moore’s law, but, then again, I don’t know if anyone’s been able to accurately measure that.
Has this been a lot slower than Moore’s law? I think OpenAI revenue is, on average, more aggressive than Moore’s law. I’d guess that LM ability to automate intellectual work is more aggressive than Moore’s law, too, but it started from a very low baseline, so it’s hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but 🤷♂️.
The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I haven’t seen signs of anything but modest improvement over the last ~2.5 years. I also don’t see many people trying to quantify those things.
I’m curious for examples here—particularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).
I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesn’t seem worth getting into.
I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:
Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAI’s customers are generating more profit for themselves by using OpenAI’s models than they were before using LLMs.
I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Moore’s law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Moore’s law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Moore’s law), then you can get a false picture of astronomically fast growth.
Back to the topic of profit… In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for “entertainment”. I was so surprised by this because you wouldn’t expect a statement that sounds so dismissive from someone in his position.
And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.
So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAI’s models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?
(We would still have to close some loopholes. For example, if a company pays to use OpenAI’s API and then just re-packages OpenAI’s models for entertainment purposes, then that shouldn’t count, since that’s the same function I wanted to exclude from the beginning and the only thing that’s different is an intermediary has been added.)
I haven’t seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://academic.oup.com/qje/article/140/2/889/7990658 The paper is open access.
Here’s an interesting quote:
My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.