These benchmarks are reporting the second of the two.
LLMs (at least without scaffolding) still do badly on ARC, and I’d wager Llama 405B still doesn’t do well on the ARC-AGI challenge, and it’s telling that all the big labs release the 95%+ number they get on AI2-ARC, and not whatever default result they get with ARC-AGI...
(Or in general, reporting benchmarks where they can go OMG SOTA!!!! and not helpfully advance the general understanding of what models can do and how far they generalise. Basically, traditional benchmark cards should be seen as the AI equivalent of “IN MICE”)
Meta has just released Llama 3.1 405B. It’s open-source and in many benchmarks it beats GPT-4o and Claude 3.5 Sonnet:
Zuck’s letter “Open Source AI Is the Path Forward”.
Wait, all the LLMs get 90+ on ARC? I thought LLMs were supposed to do badly on ARC.
It’s an unfortunate naming clash, there are different ARC Challenges:
ARC-AGI (Chollet et al) - https://github.com/fchollet/ARC-AGI
ARC (AI2 Reasoning Challenge) - https://allenai.org/data/arc
These benchmarks are reporting the second of the two.
LLMs (at least without scaffolding) still do badly on ARC, and I’d wager Llama 405B still doesn’t do well on the ARC-AGI challenge, and it’s telling that all the big labs release the 95%+ number they get on AI2-ARC, and not whatever default result they get with ARC-AGI...
(Or in general, reporting benchmarks where they can go OMG SOTA!!!! and not helpfully advance the general understanding of what models can do and how far they generalise. Basically, traditional benchmark cards should be seen as the AI equivalent of “IN MICE”)
Thanks!