I hope to write about this at length once school ends, but in short, here are the two core reasons I feel AGI in three years is quite implausible:
The models arenāt generalizing. LLMās are not stochastic parrots, they are able to learn, but the learning heuristics they adopt seem to be random or imperfect. And no, I donāt think METRās newest is evidence against this.[1]
It is unclear if models are situationally aware, and currently, it seems more likely than not that they do not possess this capability. Laine et al. (2024) shows that current models are far below human baselines of situational awareness when tested on MCQ-like questions. I am unsure how models would be able to perform long-term planningāa capability I would consider is crucial for AGIāwithout being sufficiently situationally aware.
As Beth Barnes put it, their latest benchmark specifically shows that āthereās an exponential trend with doubling time between ~2 ā12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions.ā Real world tasks rarely have such clean feedback loops; see Section 6 of METRās RE-bench paper for a thorough list of drawbacks and limitations.
I hope to write about this at length once school ends, but in short, here are the two core reasons I feel AGI in three years is quite implausible:
The models arenāt generalizing. LLMās are not stochastic parrots, they are able to learn, but the learning heuristics they adopt seem to be random or imperfect. And no, I donāt think METRās newest is evidence against this.[1]
It is unclear if models are situationally aware, and currently, it seems more likely than not that they do not possess this capability. Laine et al. (2024) shows that current models are far below human baselines of situational awareness when tested on MCQ-like questions. I am unsure how models would be able to perform long-term planningāa capability I would consider is crucial for AGIāwithout being sufficiently situationally aware.
As Beth Barnes put it, their latest benchmark specifically shows that āthereās an exponential trend with doubling time between ~2 ā12 months on automatically-scoreable, relatively clean + green-field software tasks from a few distributions.ā Real world tasks rarely have such clean feedback loops; see Section 6 of METRās RE-bench paper for a thorough list of drawbacks and limitations.