I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn’t perfect at ‘visualizing’ them.
I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Looking forward to seeing the ARC performance of future multimodal models. I’m also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.
I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn’t perfect at ‘visualizing’ them.
I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.
Looking forward to seeing the ARC performance of future multimodal models. I’m also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.