In-depth critiques are super time and labor intensive to write, so I sincerely appreciate your effort here! I am pessimistic, but I hope this post gets wider coverage.
While I don’t understand some of the modeling-based critiques here from the cursory read, it was illuminating to learn about the the basic model set up, the lack of error bars for parameters that the model is especially sensitive to, and the assumptions that so tightly constrain the forecast’s probability space. I am least sympathetic to the “they made guesstimates here and there” line of critique; forecasting seems inherently squishy, so I do not think it is fair to compare it to physics.
Another critique, and one that I am quite sympathetic to, is that the METR trend specifically shows “there’s an exponential trend with doubling time between ~2 −12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions” (source). METR is especially clear about the drawbacks of their task suite in their RE-bench paper.
I know this is somewhat of meme in the Safety community at this point (and annoyingly intertwined with the stochastic parrots critique), but I think “are models generalizing?” still remains an important and unresolved question. If LLMs are adopting poor learning heuristics and not generalizing, AI2027 is predicting a weaker kind of “superhuman” coder — one that can reliably solve software tasks with clean feedback loops but will struggle on open-ended tasks!
Anyway, thanks again for checking the models so thoroughly and the write-up!
If LLMs are adopting poor learning heuristics and not generalizing, AI2027 is predicting a weaker kind of “superhuman” coder — one that can reliably solve software tasks with clean feedback loops but will struggle on open-ended tasks!
No, AI 2027 is predicting a kind of superhuman coder that can automate even messy open ended research engineering tasks. The forecast attempts to account for gaps between automatically-scoreable, relatively clean + green-field software tasks and all tasks. (Though the adjustment might be too small in practice.)
If LLMs can’t automate such tasks and nothing else can automate such tasks, then this wouldn’t count as superhuman coder happening.
In-depth critiques are super time and labor intensive to write, so I sincerely appreciate your effort here! I am pessimistic, but I hope this post gets wider coverage.
While I don’t understand some of the modeling-based critiques here from the cursory read, it was illuminating to learn about the the basic model set up, the lack of error bars for parameters that the model is especially sensitive to, and the assumptions that so tightly constrain the forecast’s probability space. I am least sympathetic to the “they made guesstimates here and there” line of critique; forecasting seems inherently squishy, so I do not think it is fair to compare it to physics.
Another critique, and one that I am quite sympathetic to, is that the METR trend specifically shows “there’s an exponential trend with doubling time between ~2 −12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions” (source). METR is especially clear about the drawbacks of their task suite in their RE-bench paper.
I know this is somewhat of meme in the Safety community at this point (and annoyingly intertwined with the stochastic parrots critique), but I think “are models generalizing?” still remains an important and unresolved question. If LLMs are adopting poor learning heuristics and not generalizing, AI2027 is predicting a weaker kind of “superhuman” coder — one that can reliably solve software tasks with clean feedback loops but will struggle on open-ended tasks!
Anyway, thanks again for checking the models so thoroughly and the write-up!
No, AI 2027 is predicting a kind of superhuman coder that can automate even messy open ended research engineering tasks. The forecast attempts to account for gaps between automatically-scoreable, relatively clean + green-field software tasks and all tasks. (Though the adjustment might be too small in practice.)
If LLMs can’t automate such tasks and nothing else can automate such tasks, then this wouldn’t count as superhuman coder happening.