JWS 🔸 comments on defun’s Quick takes

JWS 🔸 23 Jul 2024 16:55 UTC
16 points
3 ∶ 0
It’s an unfortunate naming clash, there are different ARC Challenges:

ARC-AGI (Chollet et al) - https://github.com/fchollet/ARC-AGI
ARC (AI2 Reasoning Challenge) - https://allenai.org/data/arc
These benchmarks are reporting the second of the two.
LLMs (at least without scaffolding) still do badly on ARC, and I’d wager Llama 405B still doesn’t do well on the ARC-AGI challenge, and it’s telling that all the big labs release the 95%+ number they get on AI2-ARC, and not whatever default result they get with ARC-AGI...
(Or in general, reporting benchmarks where they can go OMG SOTA!!!! and not helpfully advance the general understanding of what models can do and how far they generalise. Basically, traditional benchmark cards should be seen as the AI equivalent of “IN MICE”)
- Elliott Thornley 23 Jul 2024 18:15 UTC
  0 points
  0 ∶ 0
  Parent
  Thanks!