Yarrow Bouchard 🔸 comments on Even after GPT-4, AI researchers forecasted a 50% chance of AGI by 2047 or 2116, depending how you define AGI

Yarrow Bouchard 🔸 31 Oct 2025 12:48 UTC
1 point
0 ∶ 0
So if plastic bags are not a valid reason to stop, it sounds like the Waymo would be at fault for the rear-end accident.
Here’s what a California law firm blog says about it:
In California, the driver who hits you from behind (and their insurance company) is almost always responsible for paying for your damages. This is based on a legal concept known as a “presumption of fault,” which assumes the rear driver was following too closely or not paying attention. Simply put, every driver has a duty by law to leave enough space to stop safely if the car in front of them brakes.
A Texas law firm blog says that if the lead driver stops abruptly for no reason, they could at most be found partially at fault, but not wholly at fault.
I agree other company failures are evidence for your point. I think Waymo is trying to scale up, and they are limited by cars at this point.
Thank you. Waymo could certainly deploy a lot more cars if supply of cars fitted with its hardware were the primary limitation. In 2018, Waymo and Jaguar announced a deal where Jaguar would produce up to 20,000 I-Pace hatchbacks for Waymo. The same year, Waymo and Chrysler announced a similar deal for up to 62,000 Pacifica minivans. It’s 7 years later and Waymo’s fleet is still only 2,000 vehicles. I would bet Waymo winds down operations before deploying 82,000 total vehicles.

Alphabet deciding to open up Waymo to external investment is an interesting signal, especially given that Alphabet has $98 billion in cash and short-term investments. This started in 2020 and is ongoing. The more optimistic explanation I heard is that Alphabet felt some pressure from employees to give them their equity-based compensation, and that required getting a valuation for Waymo, which required external investors. A more common explanation is simply that this is cost discipline; Alphabet is seeking to reduce its cash burn. But then that also means reducing Alphabet’s own equity ownership of Waymo, and therefore its share of the future opportunity.

Something I had completely forgotten is that Waymo shut down its self-driving truck program in 2023. This is possibly a bad sign. It’s interesting given that Aurora Innovation pivoted from autonomous cars to autonomous trucks, which I believe was on the theory that trucks would be easier. Anthony Levandowski also pivoted from cars to semi-trucks when he founded Pronto AI because semi-trucks were an easier problem, but then pivoted again to off-road dump trucks that haul rubble at mines and quarries (usually driving back and forth in a straight line over and over, in an area with no humans nearby).
...then that means you would only need one remote operator for 120 cars. I think that’s pretty scalable.
I couldn’t find any reliable information for Waymo, but I found a 2023 article where a Cruise spokesperson said there was one human doing remote assistance for every 15 to 20 autonomous vehicles. That article cites a New York Times article that says a human intervention was needed every 2.5 to 5 miles.
I agree that 50% is not realistic for many tasks. But they do plot some data for higher percent success: … Roughly I think going from 50% to 99.9% would be 2 hours to 4 seconds, not quite 0, but very bad!
That’s interesting, thanks. My other criticism, which is maybe not a criticism of METR’s work itself but rather a criticism of how other people interpret it, is just how narrow these tasks are. To infer that being able to do longer and longer tasks implies rapid AGI progress seems like it would require several logical inference steps between the graph and that conclusion, and I don’t think I’ve ever seen anyone spell out the logic. Which tasks you look at and how they are graded is another crucial thing to consider.

I don’t understand the details of the coding, math, or question and answer (Q&A or QA) benchmarks, but is the “time” dimension of these not just LLMs producing larger outputs, i.e., using more tokens? And, if so, why would LLMs performing tasks that use more tokens, although it may well be a sign of LLM improvement, indicate anything about AGI progress?

So many more tasks would seem way more interesting to me if we’re trying to assess AGI progress, such as:
- Picking stocks and beating the market
- Coming up with a new (and correct) idea in science, technology, engineering, math, medicine, economics, social science, etc.
- Playing a new video game that just came out with no special instructions and no walkthroughs
If you only choose the kind of tasks that are least challenging for LLMs and you don’t choose the kind of tasks that tend to confound LLMs but whose successful completion would be indicative of the sort of capabilities that would be required for AGI, then aren’t you just measuring LLM progress and not AGI progress?
This bugs me because I’ve seen people post the METR time horizon graph as if it just obviously indicates rapid progress toward AGI, but, to me, it obviously doesn’t indicate that at all. I imagine if you asked AI researchers or cognitive scientists, you would get a lot of people agreeing that the graph doesn’t indicate rapid progress toward AGI.

I mean, this is an amazing graph and a huge achievement for DeepMind, but it isn’t evidence of rapid progress toward AGI:
It’s just the progress of one AI system, AlphaStar, on one task, StarCraft.

You could make a similar graph for MuZero showing performance on 60 different tasks, namely, chess, shogi, and go, plus 57 Atari games. What does that show?

If you make a graph with a few kinds of tasks on it that LLMs are good at, like coding, math, and question and answer on automatically gradable benchmarks, what does that show? How do you logically connect that to AGI?

Incidentally, how good is o3 (or any other LLM) at chess, shogi, go, and Atari games? Or at StarCraft? If we’re making progress toward artificial general intelligence, shouldn’t one system be able to do all of these things?

DeepMind announced Gato in 2022, which tried to combine as many things as possible. But (correct me if I’m wrong) Gato was worse at all those things than models trained to do just one or a few of them. So, that’s anti-general artificial intelligence.

I just see so much unclear reasoning when it comes to this sort of thing. The sort of objections I’m bringing up are not crazy complicated or esoteric. To me they just seem straightforward and logical. I imagine you would hear these kinds of objections or better ones if you asked some impartial AI researchers if they thought the METR graph was strong evidence for rapid progress toward AGI, indicating AGI is likely within a decade. I’m sure somebody somewhere has stated these kind of objections before. So, what gives?

The whole point of my post above is that the majority of AI experts think a) LLMs probably won’t scale to AGI and b) AGI is probably at least 20 years away, if not much longer. So, why don’t people in EA engage more with experts who think these things and ask them why they think that? I’m sure they could do a much better job coming up with objections than me. Then you can either accept the objections are right and change your mind or come up with a convincing reply to the objections. But to just not anticipate the objections or not talk to people who are knowledgeable enough to raise objections is very strange. What kind of research is that? That’s so unrigorous! Where’s the due diligence?

I think EA is in this weird, crazy filter bubble/echo chamber when it comes to AGI where if there’s any expert consensus on AGI, it’s against the assertions many people in EA make about AGI.^[1] And if you try to point out the fairly obvious objections or problems with the arguments or evidence put forward, sometimes people can be incredibly scornful.^[2] I think if people do this, they should just write in a permanent marker on their forehead “I have bad epistemic practices”. Then they’ll at least save people some time from trying to talk to them. Because personally attacking or insulting someone just for disagreeing with them (as most experts do, I might add!!) means they don’t want to hear the evidence or reasoning that might change their mind. Or maybe they do want to, on some level, but, for whatever reason, they’re sure acting like they don’t.

I am trying to nudge in my very small way people in EA to apply more rigour to their arguments and evidence and to stimulate curiosity about what the dissenting views are. E.g., if you learn most AI experts disagree with you and you didn’t know that before, that should make you curious about why most experts disagree, and should introduce at least a little doubt about your own views.

If anyone has suggestions on how to stimulate more curiosity about dissenting views on near-term AGI among people in EA, please leave a reply or message me.
1. ^
  As I mentioned in the post, 76% of AI experts think it’s unlikely or very unlikely that current AI methods will scale to AGI, but this comment from Steven Byrnes a while ago that said when he talked to “a couple dozen AI safety / alignment researchers at EAG bay area”, he spoke to “a number of people, all quite new to the fields of AI and AI safety / alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models won’t scale to AGI”. That’s bananas. That’s a huge problem. How can the three-quarters majority view in a field not even be known as a concept to people in EA who are trying to work in that field?
2. ^
  Some recent examples here (the comment) and here (the downvotes) which are comparatively mild because I honestly can’t stomach looking at the worse examples right now.
- Denkenberger🔸 1 Nov 2025 0:51 UTC
  2 points
  0 ∶ 0
  Parent
  I couldn’t find any reliable information for Waymo, but I found a 2023 article where a Cruise spokesperson said there was one human doing remote assistance for every 15 to 20 autonomous vehicles. That article cites a New York Times article that says a human intervention was needed every 2.5 to 5 miles.
  That’s very helpful! I’m guessing that since 2023, a person could manage more vehicles than that, and it will continue to improve. But let’s just work with that number. I don’t think the pay for a person solving problems for Waymos would need to be that much higher than taxi drivers but even if it is 1.5x or 2x, that would still be an order of magnitude reduction in labor cost. Of course there is additional hardware cost for autonomous vehicles, but I think that can be paid off with a reasonable duty cycle. So then if you grant that the geofencing is less than a 20x safety advantage, I think there is an economic case for the chimeras, as you say.
  I don’t understand the details of the coding, math, or question and answer (Q&A or QA) benchmarks, but is the “time” dimension of these not just LLMs producing larger outputs, i.e., using more tokens? And, if so, why would LLMs performing tasks that use more tokens, although it may well be a sign of LLM improvement, indicate anything about AGI progress?
  My understanding is that 50% success on a 1 hour task means the human expert takes ~1 hour, but the computer time is not specified.
  As for AGI, I do think that some people conflate the 50% success time horizon at coding tasks reaching ~one month as AGI, when it really only means superhuman coder (I think something like 50% success rate is appropriate for long coding tasks, because humans rarely produce code that works on the first try). However, once you get to superhuman coder, the logic goes that the progress in AI research will dramatically accelerate. In AI 2027, I think there was some assumption like it would take 90 years at current progress to get something like AGI, but when things accelerate, it ends up happening in a year or two. Another worry is that you might not need AGI to have something catastrophic, because it could become superhuman in a limited number of tasks, like hacking.
  Incidentally, how good is o3 (or any other LLM) at chess, shogi, go, and Atari games? Or at StarCraft? If we’re making progress toward artificial general intelligence, shouldn’t one system be able to do all of these things?
  You’re right it’s not fully general, but AI systems are much more general than they were 7 years ago.
  I think EA is in this weird, crazy filter bubble/echo chamber when it comes to AGI where if there’s any expert consensus on AGI, it’s against the assertions many people in EA make about AGI.
  LessWrong is also known for having very short timelines, but I think the last survey indicated median time for AGI was 2040. So I do think that it is a vocal minority in EA and LW that have median timelines before 2030. However, I do agree with them that we should be spending significant effort on AI safety because it’s plausible that AGI comes in the next five years.
  Though it’s true that the expert surveys you cite have much longer timelines, I would guess that if you took a poll of people in the AI labs (which I think would qualify as a group of experts—biased though they may be, they do have inside information), their median timelines would be before 2030.
  - Yarrow Bouchard 🔸 1 Nov 2025 12:56 UTC
    1 point
    0 ∶ 0
    Parent
    I don’t think the pay for a person solving problems for Waymos would need to be that much higher than taxi drivers but even if it is 1.5x or 2x, that would still be an order of magnitude reduction in labor cost. Of course there is additional hardware cost for autonomous vehicles, but I think that can be paid off with a reasonable duty cycle. So then if you grant that the geofencing is less than a 20x safety advantage, I think there is an economic case for the chimeras, as you say.
    This could be true, but you have to account for other elements of the cost structure. For example, can you improve the ratio of engineers to autonomous vehicles from the current ratio of around 1:1 or 1:2 to something like 1:1000 or 1:10,000?
    
    It seems like Waymo is using methods that scale with engineer labour rather than learning methods that scale with data and compute. So, to deploy more vehicles in more areas would require commensurately more engineers, and, in addition to being too expensive, there are simply not enough of them on Earth.
    As for AGI, I do think that some people conflate the 50% success time horizon at coding tasks reaching ~one month as AGI, when it really only means superhuman coder (I think something like 50% success rate is appropriate for long coding tasks, because humans rarely produce code that works on the first try).
    It doesn’t mean that, either. METR has found that current frontier AI systems are worse in real world, practical use cases than not using AI at all.
    
    Automatically gradable benchmarks generally don’t seem to have much to do with the ability to do tasks in the real world. So, predicting real world performance based on benchmark performance seems to just be an invalid inference.
    Anecdotally, what I hear from people who say they find AI coding assistants useful is that it saves them the time it would take to copy and paste code from Stack Exchange. I have never heard anything along the lines of “it came up with a new idea” or “it was creative”. Yet this is what human-level coding would require.
    However, once you get to superhuman coder, the logic goes that the progress in AI research will dramatically accelerate. In AI 2027, I think there was some assumption like it would take 90 years at current progress to get something like AGI, but when things accelerate, it ends up happening in a year or two.
    The obvious objection to this as it pertains to the initial advent of AGI, rather than to superintelligence, is that this is a chicken-and-egg problem. If you need AGI to do AGI R&D, AGI can’t help you develop AGI because you haven’t invented it yet. You would need a sub-human AI that can do tasks that speed up AI research and AI engineering. And that seems dubious. Is automating this kind of work not an AGI-level problem?
    You’re right it’s not fully general, but AI systems are much more general than they were 7 years ago.
    I don’t know if I believe this. LLMs are impressive, but their scope is fairly narrow. They have memorized most of the important digital/digitized text that’s available. Their next-token prediction and everything layered on top of that like fine-tuning to follow instructions, reinforcement learning from human feedback (RLHF), and Chain of Thought results in some impressive behaviours. But they are extremely brittle. They routinely make errors on very basic tasks.
    
    I think of LLMs more like another type of AI system that are proficient in one area, comparable to game-playing RL agents. LLMs are good at many text-related tasks (including math and coding, which are also text) but they aren’t able to generalize beyond the text-related tasks they have massive amounts of training data for. They don’t do well outside of text-related tasks, they don’t do well with novelty, they frequently fail to reason properly, etc.
    
    So, I’m not sure LLMs are all that more general than previous systems like MuZero and AlphaZero.
    
    Part of generality or generalization is that you should see positive transfer learning, i.e., having skills in some domains should improve the AI system’s skills in other domains. But it seems like we see is the opposite. That is, it seems like we see negative transfer learning. Training an AI on many diverse, heterogeneous tasks from multiple domains seems to hurt performance. That’s narrowness, not generality.
    LessWrong is also known for having very short timelines, but I think the last survey indicated median time for AGI was 2040. So I do think that it is a vocal minority in EA and LW that have median timelines before 2030.
    That’s very interesting, if you remember correctly. I would be interested in seeing survey data both for LessWrong and for EA.
    - Denkenberger🔸 1 Nov 2025 20:57 UTC
      2 points
      0 ∶ 0
      Parent
      If you need AGI to do AGI R&D, AGI can’t help you develop AGI because you haven’t invented it yet. You would need a sub-human AI that can do tasks that speed up AI research and AI engineering. And that seems dubious. Is automating this kind of work not an AGI-level problem?
      So, I’m not sure LLMs are all that more general than previous systems like MuZero and AlphaZero.
      I don’t think you can have it both ways: A superhuman coder (that is actually competent, which you don’t think AI assistants are now) is relatively narrow AI, but would accelerate AI progress. A superhuman AI researcher is more general (which would drastically speed up AI progress), but is not fully general. I would argue that LLMs now are more general than AI researcher tasks (though LLMs are currently not good at all of those tasks), because LLMs can competently discuss philosophy, economics, political science, art, history, engineering, science, etc.
      For example, can you improve the ratio of engineers to autonomous vehicles from the current ratio of around 1:1 or 1:2 to something like 1:1000 or 1:10,000?
      I’m claiming that they could approach overall staff to vehicle ratio of 1:10 if the number of real-time helpers (which don’t have to be engineers) and vehicles were dramatically scaled up, and that’s enough for profitability.
      I would be interested in seeing survey data both for LessWrong and for EA.
      The 2023 LessWrong survey was median 2040 for singularity, and 2030 for “By what year do you think AI will be able to do intellectual tasks that expert humans currently do?”. The second question was ambiguous, and some people put it in the past. I haven’t seen a similar survey result for EAs, but I expect longer timelines than LW.
      - Yarrow Bouchard 🔸 2 Nov 2025 13:01 UTC
        1 point
        0 ∶ 0
        Parent
        I don’t think you can have it both ways: A superhuman coder (that is actually competent, which you don’t think AI assistants are now) is relatively narrow AI, but would accelerate AI progress. A superhuman AI researcher is more general (which would drastically speed up AI progress), but is not fully general.
        I definitely disagree with this. Hopefully what I say below will explain why.
        I would argue that LLMs now are more general than AI researcher tasks (though LLMs are currently not good at all of those tasks), because LLMs can competently discuss philosophy, economics, political science, art, history, engineering, science, etc.
        The general in artificial general intelligence doesn’t just refer to having a large repertoire of skills. Generality is about the ability to learn to generalize beyond what a system has seen in its training data. An artificial general intelligence doesn’t just need to have new skills, it needs to be able to acquire new skills, and to acquire new skills that have never existed in history before by developing them itself — just as humans do.
        
        If a new video game comes out today, I’m able to play that game and develop a new skill that has never existed before.^[1] I will probably get the hang of it in a few minutes, with a few attempts. That’s general intelligence.
        
        AlphaStar was not able to figure out how to play StarCraft using pure reinforcement learning. It just got stuck using its builders to attack the enemy, rather than figuring out how to use its builders to make buildings that produce units that attack. To figure out the basics of the game, it needed to do imitation learning on a very large dataset of human play. Then, after imitation learning, to get as good as it did, it needed to do an astronomical amount of self-play, around 60,000 years of playing StarCraft. That’s not general intelligence. If you need to copy a large dataset of human examples to acquire a skill and millennia of training on automatically gradable, relatively short time horizon tasks (which often don’t exist in the real world), that’s something, and it’s even something impressive, but it’s not general intelligence.
        
        Let’s say you wanted to apply this kind of machine learning to AI R&D. The necessary conditions don’t apply. You don’t have a large dataset of human examples to train on. You don’t have automatically gradable, relatively short time horizon tasks with which to do reinforcement learning. And if the tasks require real world feedback and can’t be simulated, you certainly don’t have 60,000 years.
        
        I like what the AI researcher François Chollet has to say about this topic in this video from 11:45 to 20:00. He draws the distinction between crystallized behaviours and fluid intelligence, between skills and the ability to learn skills. I think this is important. This is really what the whole topic of AGI is about.
        
        Why have LLMs absorbed practically all text on philosophy, economics, political science, art, history, engineering, science, and so on and not come up with a single novel and correct idea of any note in any of these domains? They are not able to generalize enough to do so. They can generalize or interpolate a little bit beyond their training data, but not very much. It’s that generalization ability (which is mostly missing in LLMs) that’s the holy grail in AI research.
        I’m claiming that they could approach overall staff to vehicle ratio of 1:10 if the number of real-time helpers (which don’t have to be engineers) and vehicles were dramatically scaled up, and that’s enough for profitability.
        There are two concepts here. One is remote human assistance, which Waymo calls fleet response. The other is Waymo’s approach to the engineering problem. I was saying that I suspect Waymo’s approach to the engineering problem doesn’t scale. I think it probably relies on engineers doing too much special casing that doesn’t generalize well when a modest amount of novelty is introduced. So, Waymo currently has something like 1,500 engineers to operate in the comparatively small geofenced areas where it currently operates. If it wanted to expand where it drives to a 10x larger area, would its techniques generalize to that larger area, or would it need to hire commensurately more engineers?
        
        I suspect that Waymo faces the problem of trying to do far too much essentially by hand, just adding incremental fix after fix as problems arise. The ideal would be to, instead, apply machine learning techniques that can learn from data and generalize to new scenarios and new driving conditions. Unfortunately, current machine learning techniques do not seem to be up to that task.
        The 2023 LessWrong survey was median 2040 for singularity, and 2030 for “By what year do you think AI will be able to do intellectual tasks that expert humans currently do?”. The second question was ambiguous, and some people put it in the past.
        Thank you. Well, that isn’t surprising at all.
        ^
        Okay, well maybe the play testers and the game developers have developed the skill before me, but then at some point one of them had to be the first person in history to ever acquire the skill of playing that game.
        Denkenberger🔸 23 Dec 2025 21:38 UTC
        4 points
        0 ∶ 0
        Parent
        Quoting myself:
        So I do think that it is a vocal minority in EA and LW that have median timelines before 2030.
        Now we have some data on AGI timelines for EA (though it was only 34 responses, so of course there could be large sample bias): about 15% expect it by 2030 or sooner.
        Yarrow Bouchard 🔸 26 Dec 2025 22:23 UTC
        2 points
        0 ∶ 0
        Parent
        But 47% (16 out of 34) put their median year no later than 2032 and 68% (23 out of 34) put their median year no later than 2035, so how significant a finding this is depends how much you care about those extra 2-5 years, I guess.
        
        Only 12% (4 out of 34) of respondents to the poll put their median year after 2050. So, overall, respondents overwhelmingly see relatively near-term AGI (within 25 years) as at least 50% likely.