Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually “visualize” the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function’s shape will take, even if the resulting graph is shown to everyone else.
I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn’t perfect at ‘visualizing’ them.
I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Looking forward to seeing the ARC performance of future multimodal models. I’m also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.
Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually “visualize” the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function’s shape will take, even if the resulting graph is shown to everyone else.
I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn’t perfect at ‘visualizing’ them.
I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Looking forward to seeing the ARC performance of future multimodal models. I’m also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.