benchmarking model behavior seems increasingly hard to keep track off.
I think there are a bunch of separate layers that are being analyzed, and it’s increasingly complicated the degree to which they are separate.
e.g.
level 1 → pre-trained only level 2 → post-trained level 3 → with ____ amount of inference (pro, high, low, thinking, etc.) level 4 → with agentic scaffolding (Claude code, swe agent, codex) level 5 → context engineering setups inside of your agentic repo (ACE, GEPA, ai scientists) level 6 → The built digital environment (arguably could be included partially in level 4, stuff like api’s being crafted to be better for llms, workflows being re written to accomplish the same goal in a more verifiable way, ui’s that are more readable by llms).
In some sense you can boil all of this down to cost per run, like ARC, but you will miss important differences in model behavior on a fixed frontier.
if you read J. Berman substack, you will see he uses existing llms to get his scores with an evolutionary scaffolding (hard to place this as being level 4⁄5). While I’m decently bitter lesson pilled, It seems plausible we will see proto-agis popping up that are heavily scaffolded before raw models reach that level (though also plausibly we might just see the really useful, generalizable scaffolds get consumed by the model soon thereafter). The behavior of j bermans system is going to be different than the first raw model to hit that score with no scaffolding and pose different threats at the same level of intelligence.
benchmarking model behavior seems increasingly hard to keep track off.
I think there are a bunch of separate layers that are being analyzed, and it’s increasingly complicated the degree to which they are separate.
e.g.
level 1 → pre-trained only
level 2 → post-trained
level 3 → with ____ amount of inference (pro, high, low, thinking, etc.)
level 4 → with agentic scaffolding (Claude code, swe agent, codex)
level 5 → context engineering setups inside of your agentic repo (ACE, GEPA, ai scientists)
level 6 → The built digital environment (arguably could be included partially in level 4, stuff like api’s being crafted to be better for llms, workflows being re written to accomplish the same goal in a more verifiable way, ui’s that are more readable by llms).
In some sense you can boil all of this down to cost per run, like ARC, but you will miss important differences in model behavior on a fixed frontier.
https://jeremyberman.substack.com/p/how-i-got-the-highest-score-on-arc-agi-again
if you read J. Berman substack, you will see he uses existing llms to get his scores with an evolutionary scaffolding (hard to place this as being level 4⁄5). While I’m decently bitter lesson pilled, It seems plausible we will see proto-agis popping up that are heavily scaffolded before raw models reach that level (though also plausibly we might just see the really useful, generalizable scaffolds get consumed by the model soon thereafter). The behavior of j bermans system is going to be different than the first raw model to hit that score with no scaffolding and pose different threats at the same level of intelligence.