Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.
Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.
I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.