These benchmarks are reporting the second of the two.
LLMs (at least without scaffolding) still do badly on ARC, and I’d wager Llama 405B still doesn’t do well on the ARC-AGI challenge, and it’s telling that all the big labs release the 95%+ number they get on AI2-ARC, and not whatever default result they get with ARC-AGI...
(Or in general, reporting benchmarks where they can go OMG SOTA!!!! and not helpfully advance the general understanding of what models can do and how far they generalise. Basically, traditional benchmark cards should be seen as the AI equivalent of “IN MICE”)
It’s an unfortunate naming clash, there are different ARC Challenges:
ARC-AGI (Chollet et al) - https://github.com/fchollet/ARC-AGI
ARC (AI2 Reasoning Challenge) - https://allenai.org/data/arc
These benchmarks are reporting the second of the two.
LLMs (at least without scaffolding) still do badly on ARC, and I’d wager Llama 405B still doesn’t do well on the ARC-AGI challenge, and it’s telling that all the big labs release the 95%+ number they get on AI2-ARC, and not whatever default result they get with ARC-AGI...
(Or in general, reporting benchmarks where they can go OMG SOTA!!!! and not helpfully advance the general understanding of what models can do and how far they generalise. Basically, traditional benchmark cards should be seen as the AI equivalent of “IN MICE”)
Thanks!