Yes I basically agree that’s the biggest limiting factor at this point.
However, a better base model can improve agency via e.g. better perception (which is still weak).
And although reasoning models are good at science and math, they still make dumb mistakes reasoning about other domains, and very high reliability is needed for agents. So I expect better reasoning models also helps with agency quite a bit.
Benjamin_Todd
I feel subtweeted :p As far as I can tell, most of the wider world isn’t aware of the arguments for shorter timelines, and my pieces are aimed at them, rather than people already in the bubble.
That said, I do think there was a significant shortening of timelines from 2022 to 2024, and many people in EA should reassess whether their plans still make sense in light of that (e.g. general EA movement building looks less attractive relative to direct AI work compared to before).
Beyond that, I agree people shouldn’t be making month-to-month adjustments to their plans based on timelines, and should try to look for robust interventions.
I also agree many people should be on paths that build their leverage into the 2030s, even if there’s a chance it’s ‘too late’. It’s possible to get ~10x more leverage by investing in career capital / org building / movement building, and that can easily offset. I’ll try to get this message across in the new 80k AI guide.
Also agree for strategy it’s usually better to discuss specific capabilities and specific transformative effects you’re concerned about, rather than ‘AGI’ in general. (I wrote about AGI because it’s the most commonly used term outside of EA and was aiming to reach new people.)
Apparently there’s a preprint showing Gemini 2.5 gets 20% on the Olympiad questions, which would be in line with the o3 result.
What can we learn from expert AGI forecasts?
I wouldn’t totally defer to them, but I wouldn’t totally ignore them either. (And this is mostly besides the point since I’m overall I’m critical of using their forecasts and my argument doesn’t rest on this.)
I only came across this paper in the last few days! (The post you link to is from 5th April; my article was first published 21st March.)
I want to see more commentary on the paper before deciding what to do about it. My current understanding:
o3-mini seems to be a lot worse than o3 – it only got ~10% on Frontier Math, similar to o1. (Claude Sonnet 3.7 only gets ~3%.)
So the results actually seem consistent with Frontier Math, except they didn’t test o3, which is significantly ahead of other models.
The other factor seems to be that they evaluated the quality of the proofs rather than the ability to get a correct numerical answer.
I’m not sure data leakage is a big part of the difference.
Here we’re also talking about capabilities rather than harm. If you want to find out how fast cars will be in 5 years, asking the auto industry seems like a reasonable move.
So, OpenAI is telling the truth when it says AGI will come soon and lying when it says AGI will not come soon?
I don’t especially trust OpenAI’s statements on either front.
The framing of the piece is “the companies are making these claims, let’s dig into the evidence for ourselves” not “let’s believe the companies”.
(I think the companies are most worth listening to when it comes to specific capabilities that will arrive in the next 2-3 years.)
I agree those two statements don’t obviously seem inconsistent, though independently it seems to me Dario probably has been too optimistic historically.
I discuss expert views here. I don’t put much weight on the superforecaster estimates you mention at this point because they were made in 2022, before the dramatic shortening in timelines due to chatGPT (let alone reasoning models).
They also (i) made compute forecasts that were very wrong (ii) don’t seem to know that much about AI (iii) were selected for expertise in forecasting near-term political events, which might not generalise very well to longer-term forecasting of a new technology.
I agree we should consider the forecast, but I think it’s ultimately pretty weak evidence.
The AI experts survey also found a 25% chance of AI that “can do all tasks better than a human” by 2032. I don’t know why they think it’ll take so much longer to “automate all jobs” – it seems likely they’re just not thinking about it very carefully (especially since they estimate ~50% of an intelligence explosion starting after AI can do “all tasks”); or it could be because they think there will be a bunch of jobs where people have a strong preference for a human to be in them (e.g. priest, artist), even if AI is technically better at everything.
The AI experts have also been generally too pessimistic and e.g. in 2023 predicted that AI couldn’t do simple Python programming until 2025, though it could probably already do that at the time. I expect their answers in the next survey will be shorter again. And they’re also not experts in forecasting.
Thank you!
I am roughly in agreement with this post by an AI expert responding to the other (less good) short- timeline article going around.
This post just points out that the AI 2027 article is an attempt to flesh out a particular scenario, rather than an argument for short timelines, which the authors of AI 2027 would agree with.
I thought instead of critiquing the parts that I’m not an expert in, I might take a look at the part of this post that intersects with my field, when you mention material science discovery, and pour just a little bit of cold water on it.
So, an important thing to note is that this was not an LLM (neither was alphafold), but a specially designed deep learning model for generating candidate material structures.
Yes, I explicitly wanted to point out that AI can be useful to science beyond LLMs.
I covered a bit about them in my last article, this is a nice bit of evidence for their usefulness. The possibility space for new materials is ginormous and humans are not that good at generating new ones: the paper showed that this tool boosted productivity by making that process significantly easier. I don’t like how the paper described this as “idea generation”: it evokes the idea that the AI is making it’s own newtonian flashes of scientific insight, but actually it’s just mass generating candidate materials that an experienced professional can sift through.
I agree it’s not having flashes of insight, but I also think people under-estimate how useful brute force problem solving could be. I expect AI to become useful to science well before it has ‘novel insights’ in the way we imagine genius humans to have them.
I think your quoted statement is technically true, but it’s worth mentioning that the 80% faster figure was just for the people previously in the top decile of performance (ie the best researchers), for people who were not performing well there was not evidence of a real difference.
I do say it increased the productivity of ‘top’ researchers, and it’s also clarified through the link. (To my mind, it makes it more impressive, since it was adding value even to the best researchers.)
In practice the effect of the tool on progress was less than this: it was plausibly attributed to increasing the number of new patents at a firm by roughly 40%, and increasing the number of actual prototypes by 20%. You can also see that the productivity is not continuing to increase: they got their boost from the improved generation pipeline, and now the bottleneck is somewhere else.
20% more prototypes and 40% new patents sounds pretty meaningful.
I was just trying to illustrate that AI is already starting to contribute to scientific productivity in the near-term.
Productivity won’t continually increase until something more like a fully automated-scientist is created (which we clearly don’t already have).
To be clear, this is still great, and a clear deep learning success story, but it’s not really in line with colonizing the mars in 2035 or whatever the ASI people are saying now.
I’m not sure I follow. No-one is claiming that AI can already do these things – the claim is that if progress continues, then you could reach a point where AI accelerates AI research, and from there you get to something like ASI, and from there space colonisation. To argue against that you need to show the rate of progress is insufficient to get there.
I think Ege is one of the best proponents of longer timelines, and link to that episode in the article.
I don’t put much stock in the forecast of AI researchers the graph is from. I see the skill of forecasting as very different from the skill of being a published AI researcher. A lot of their forecasts also seem inconsistent. A bit more discussion here: https://80000hours.org/2025/03/when-do-experts-expect-agi-to-arrive/
Financially, I’m already heavily exposed to short AI timelines via my investments.
The next few years, I expect AI revenues to continue to increase 2-4x per year, like they have recently, which gets you to those kinds of numbers in 2027.
There won’t be widespread automation, rather AI will make money from a few key areas with few barriers, especially programming.
You could then reach an inflection point where AI starts to help with AI research. AI inference gets mostly devoted to that task for a while. Major progress is made, perhaps reaching AGI, without further external deployment.
Revenues would then explode after that point, but OpenAI aren’t going to put that in their investor deck right now.You could also see an acceleration in revenues when agents start to work. And in general I expect revenues to strongly lag capabilities. (Revenue also depends on the diff between the leading model and best free model.)
Overall I see the near term revenue figures as consistent with an AGI soon scenario. I agree 100bn in 2029 is harder to square, but I think that’s in part because OpenAI thinks investors won’t believe higher figures.
The case for AGI by 2030
Thanks this is helpful.
This is my understanding too – some crucial questions going forward:
How useful are AIs that are mainly good at these verifiable tasks?
How much does getting better at reasoning on these verifiable tasks generalise to other domains? (It seems like at least a bit e.g. o1 improved at law)
How well will reinforcement learning work when applied at scale to areas with weaker reward signals?
Pretty sure o1 and Gemini have access to the internet.
The main way it’s potentially misleading is that it’s not a log plot (most benchmark results will look like exponentials on a linear scale) – however, I expect Deep Research would still seem above trend even if it was. I also think it’s helpful to new readers to see some of the charts on linear scales, since in some ways it’s more intuitive.
Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.
For test time compute, you need to do logarithmic increases of compute to get linear increases in accuracy on the benchmark. It’s similar to the pretraining scaling law.
I agree test time compute isn’t especially explosive – it mainly serves to “pull forward” more advanced capabilities by 1-2 years.
More broadly, you can swap training for inference: https://epoch.ai/blog/trading-off-compute-in-training-and-inference
On brute force, I mainly took Toby’s thread to be saying we don’t clearly have enough information to know how effective test time compute is vs. brute force.
It’s the first chapter in a new guide about how to help make AI go well (aimed at new audiences).
I think it’s generally important for people who want to help to understand the strategic picture.
Plus in my experience the thing most likely to make people take AI risk more seriously is believing that powerful AI might happen soon.
I appreciate that talking about this could also wake more people up to AGI, but I expect the guide overall will proportionally boost the safety talent pool a lot more than the speeding up AI talent pool.
(And long term I think it’s also better to be open about my actual thinking rather than try to message control to that degree, and a big part of the case in favour in my mind is that it might happen soon.)