Can anyone point me to a good analysis of the ARC test’s legitimacy/value? I was a bit surprised when I listened to the podcast, as they made it seem like a high-quality, general-purpose test, but then I was very disappointed to see it’s just a glorified visual pattern abstraction test. Maybe I missed some discussion of it in the podcasts I listened to, but it just doesn’t seem like people pushed back hard enough on the legitimacy of comparing “language model that is trying to identify abstract geometric patterns through a JSON file” vs. “humans that are just visually observing/predicting the patterns.”
Like, is it wrong to demand that humans should have to do this test purely by interpreting the JSON (with no visual aide)?
Language models have no problem interpreting the image correctly. You can ask them for a description of the input grid and they’ll get it right, they just don’t get the pattern.
I wouldn’t be surprised if that’s correct (though I haven’t seen the tests), but that wasn’t my complaint. A moderately smart/trained human can also probably convert from JSON to a description of the grid, but there’s a substantial difference in experience from seeing even a list of grid square-color labels vs. actually visualizing it and identifying the patterns. I would strike a guess that humans who are only given a list of square color labels (not just the raw JSON) would perform significantly worse if they are not allowed to then draw out the grids.
And I would guess that even if some people do it well, they are doing it well because they convert from text to visualization.
I might be misunderstanding you here. You can easily get ChatGPT to convert the image to a grid representation/visualization, e.g. in Python, not just a list of square-color labels. It can formally draw out the grid any way you want and work with that, but still doesn’t make progress.
Also, to answer your initial question about ARC’s usefulness, the idea is just that these are simple problems where relevant solution strategies don’t exist on the internet. A non-visual ARC analog might be, as Chollet mentioned, Caesar ciphers with non-standard offsets.
Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually “visualize” the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function’s shape will take, even if the resulting graph is shown to everyone else.
I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn’t perfect at ‘visualizing’ them.
I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Looking forward to seeing the ARC performance of future multimodal models. I’m also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.
The paper that introduces the test is probably what you’re looking for. Based on a skim, it seems to me that it spends a lot of words laying out the conceptual background that would make this test valuable. Obviously it’s heavily selected for making the overall argument that the test is good.
“humans that are just visually observing/predicting the patterns.”
I don’t think that’s actually any simpler than doing it as JSON; it’s just that our brains are tuned for (and we’re more accustomed to) doing it visually. Depending on the specifics of the JSON format, there may be a bit of advantage to being able to have adjacency be natively two-dimensional, but I wouldn’t expect that to make a huge difference.
Again, I’d be interested to actually see humans attempt the test by viewing the raw JSON, without being allowed to see/generate any kind of visualization of the JSON. I suspect that most people will solve it by visualizing and manipulating it in their head, as one typically does with these kinds of problems.
Perhaps you (a person with syntax in their username) would find this challenge quite easy! Personally, I don’t think I could reliably do it without substantial practice, especially if I’m prohibited from visualizing it.
Can anyone point me to a good analysis of the ARC test’s legitimacy/value? I was a bit surprised when I listened to the podcast, as they made it seem like a high-quality, general-purpose test, but then I was very disappointed to see it’s just a glorified visual pattern abstraction test. Maybe I missed some discussion of it in the podcasts I listened to, but it just doesn’t seem like people pushed back hard enough on the legitimacy of comparing “language model that is trying to identify abstract geometric patterns through a JSON file” vs. “humans that are just visually observing/predicting the patterns.”
Like, is it wrong to demand that humans should have to do this test purely by interpreting the JSON (with no visual aide)?
Language models have no problem interpreting the image correctly. You can ask them for a description of the input grid and they’ll get it right, they just don’t get the pattern.
I wouldn’t be surprised if that’s correct (though I haven’t seen the tests), but that wasn’t my complaint. A moderately smart/trained human can also probably convert from JSON to a description of the grid, but there’s a substantial difference in experience from seeing even a list of grid square-color labels vs. actually visualizing it and identifying the patterns. I would strike a guess that humans who are only given a list of square color labels (not just the raw JSON) would perform significantly worse if they are not allowed to then draw out the grids.
And I would guess that even if some people do it well, they are doing it well because they convert from text to visualization.
I might be misunderstanding you here. You can easily get ChatGPT to convert the image to a grid representation/visualization, e.g. in Python, not just a list of square-color labels. It can formally draw out the grid any way you want and work with that, but still doesn’t make progress.
Also, to answer your initial question about ARC’s usefulness, the idea is just that these are simple problems where relevant solution strategies don’t exist on the internet. A non-visual ARC analog might be, as Chollet mentioned, Caesar ciphers with non-standard offsets.
Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually “visualize” the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function’s shape will take, even if the resulting graph is shown to everyone else.
I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn’t perfect at ‘visualizing’ them.
I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Looking forward to seeing the ARC performance of future multimodal models. I’m also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.
The paper that introduces the test is probably what you’re looking for. Based on a skim, it seems to me that it spends a lot of words laying out the conceptual background that would make this test valuable. Obviously it’s heavily selected for making the overall argument that the test is good.
I don’t think that’s actually any simpler than doing it as JSON; it’s just that our brains are tuned for (and we’re more accustomed to) doing it visually. Depending on the specifics of the JSON format, there may be a bit of advantage to being able to have adjacency be natively two-dimensional, but I wouldn’t expect that to make a huge difference.
Again, I’d be interested to actually see humans attempt the test by viewing the raw JSON, without being allowed to see/generate any kind of visualization of the JSON. I suspect that most people will solve it by visualizing and manipulating it in their head, as one typically does with these kinds of problems. Perhaps you (a person with syntax in their username) would find this challenge quite easy! Personally, I don’t think I could reliably do it without substantial practice, especially if I’m prohibited from visualizing it.