Ryan Greenblatt comments on On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI

Ryan Greenblatt 28 Jun 2024 18:20 UTC
2 points
0 ∶ 0
Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.