My best guess, however, is that we’re within spitting distance[1] of scaffolded LLMs being able to solve these. (Unscaffolded LLMs would I think be way off.)
What I really mean by this is something like “gee it really seems like you should be able to do this already with good enough scaffolding”. Then my actual timeline for that to turn into a real system that someone’s built which has done it is uncertain, and plausible values range from “it’s already happened” to “it takes two or three more years”.
I do think this set of benchmarks is neat.
My best guess, however, is that we’re within spitting distance[1] of scaffolded LLMs being able to solve these. (Unscaffolded LLMs would I think be way off.)
What I really mean by this is something like “gee it really seems like you should be able to do this already with good enough scaffolding”. Then my actual timeline for that to turn into a real system that someone’s built which has done it is uncertain, and plausible values range from “it’s already happened” to “it takes two or three more years”.