At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they don’t come across in their training set. I think if the score was claimed, we’d want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/data leakage/impressive but overly specific and intuitively unsatisfying solution.
If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) I’d definitely change my opinions, but what they’d change to depends on the matter of how ARC-AGI was solved. That’s all I’m trying to say in that section (perhaps poorly)
It sound like you agree with my claims that ARC-AGI isn’t that likely to track progress and that other benchmarks could work better?
(The rest of your response seemed to imply something different.)
At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they don’t come across in their training set. I think if the score was claimed, we’d want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/data leakage/impressive but overly specific and intuitively unsatisfying solution.
If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) I’d definitely change my opinions, but what they’d change to depends on the matter of how ARC-AGI was solved. That’s all I’m trying to say in that section (perhaps poorly)