-If you look at easy benchmarks like ARC-AGI and ARC-AGI-2 that are easy for humans to solve and intentionally designed to be a low bar for AI to clear, the weaknesses of frontier AI models are starkly revealed.
I don’t think they are designed to be a low bar to clear. They seem very adversarially selected, though I agree that LMs do poorly on them relative to subjectively more difficult tasks like coding. It seems pretty hard to make a timelines update from ARC-AGI unless you are very confident in the importance of abstract shape rotation problems for much more concrete problems, or you care about some notion of “intelligence” much more than automating intellectual labour.
I don’t think they are designed to be a low bar to clear.
Based on what?
This is what François Chollet said about ARC-AGI in a post on Bluesky from January 6, 2025:
I don’t think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.
It was designed as the simplest, most basic assessment of fluid intelligence possible.
Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations. Passing it means your system exhibits non-zero fluid intelligence—you’re finally looking at something that isn’t pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
On Dwarkesh Patel’s podcast, Chollet emphasized that pretty much anybody can solve ARC-AGI puzzles, even children.
It seems pretty hard to make a timelines update from ARC-AGI unless you are very confident in the importance of abstract shape rotation problems for much more concrete problems, or you care about some notion of “intelligence” much more than automating intellectual labour.
You’ve got to measure something and the most commonly cited benchmarks for LLMs mostly seem to measure memorizing large quantities of text with very limited generalization to novel chunks of text. That’s cool, but I don’t think it’s measuring general intelligence.
ARC-AGI and the new and improved ARC-AGI-2 are specifically designed to measure progress toward AGI by focusing on capabilities that humans have and AI doesn’t. I don’t know if it succeeds in measuring general intelligence, but I find it a lot more interesting than the benchmarks that reward memorizing text.
I think it would be a good idea for others to take inspiration from ARC-AGI-2 and design new benchmarks that specifically focus on what humans can do ~100% of the time and what AI can do ~0% of the time. If you don’t try to measure this, and you aren’t really careful and thoughtful in how you measure it, you risk ending up with distorted conclusions about AGI progress.
-If you look at easy benchmarks like ARC-AGI and ARC-AGI-2 that are easy for humans to solve and intentionally designed to be a low bar for AI to clear, the weaknesses of frontier AI models are starkly revealed.
I don’t think they are designed to be a low bar to clear. They seem very adversarially selected, though I agree that LMs do poorly on them relative to subjectively more difficult tasks like coding. It seems pretty hard to make a timelines update from ARC-AGI unless you are very confident in the importance of abstract shape rotation problems for much more concrete problems, or you care about some notion of “intelligence” much more than automating intellectual labour.
Based on what?
This is what François Chollet said about ARC-AGI in a post on Bluesky from January 6, 2025:
On Dwarkesh Patel’s podcast, Chollet emphasized that pretty much anybody can solve ARC-AGI puzzles, even children.
You’ve got to measure something and the most commonly cited benchmarks for LLMs mostly seem to measure memorizing large quantities of text with very limited generalization to novel chunks of text. That’s cool, but I don’t think it’s measuring general intelligence.
ARC-AGI and the new and improved ARC-AGI-2 are specifically designed to measure progress toward AGI by focusing on capabilities that humans have and AI doesn’t. I don’t know if it succeeds in measuring general intelligence, but I find it a lot more interesting than the benchmarks that reward memorizing text.
I think it would be a good idea for others to take inspiration from ARC-AGI-2 and design new benchmarks that specifically focus on what humans can do ~100% of the time and what AI can do ~0% of the time. If you don’t try to measure this, and you aren’t really careful and thoughtful in how you measure it, you risk ending up with distorted conclusions about AGI progress.