So if plastic bags are not a valid reason to stop, it sounds like the Waymo would be at fault for the rear-end accident.
Hereâs what a California law firm blog says about it:
In California, the driver who hits you from behind (and their insurance company) is almost always responsible for paying for your damages. This is based on a legal concept known as a âpresumption of fault,â which assumes the rear driver was following too closely or not paying attention. Simply put, every driver has a duty by law to leave enough space to stop safely if the car in front of them brakes.
A Texas law firm blog says that if the lead driver stops abruptly for no reason, they could at most be found partially at fault, but not wholly at fault.
I agree other company failures are evidence for your point. I think Waymo is trying to scale up, and they are limited by cars at this point.
Thank you. Waymo could certainly deploy a lot more cars if supply of cars fitted with its hardware were the primary limitation. In 2018, Waymo and Jaguar announced a deal where Jaguar would produce up to 20,000 I-Pace hatchbacks for Waymo. The same year, Waymo and Chrysler announced a similar deal for up to 62,000 Pacifica minivans. Itâs 7 years later and Waymoâs fleet is still only 2,000 vehicles. I would bet Waymo winds down operations before deploying 82,000 total vehicles.
Alphabet deciding to open up Waymo to external investment is an interesting signal, especially given that Alphabet has $98 billion in cash and short-term investments. This started in 2020 and is ongoing. The more optimistic explanation I heard is that Alphabet felt some pressure from employees to give them their equity-based compensation, and that required getting a valuation for Waymo, which required external investors. A more common explanation is simply that this is cost discipline; Alphabet is seeking to reduce its cash burn. But then that also means reducing Alphabetâs own equity ownership of Waymo, and therefore its share of the future opportunity.
Something I had completely forgotten is that Waymo shut down its self-driving truck program in 2023. This is possibly a bad sign. Itâs interesting given that Aurora Innovation pivoted from autonomous cars to autonomous trucks, which I believe was on the theory that trucks would be easier. Anthony Levandowski also pivoted from cars to semi-trucks when he founded Pronto AI because semi-trucks were an easier problem, but then pivoted again to off-road dump trucks that haul rubble at mines and quarries (usually driving back and forth in a straight line over and over, in an area with no humans nearby).
...then that means you would only need one remote operator for 120 cars. I think thatâs pretty scalable.
I couldnât find any reliable information for Waymo, but I found a 2023 article where a Cruise spokesperson said there was one human doing remote assistance for every 15 to 20 autonomous vehicles. That article cites a New York Times article that says a human intervention was needed every 2.5 to 5 miles.
I agree that 50% is not realistic for many tasks. But they do plot some data for higher percent success: ⌠Roughly I think going from 50% to 99.9% would be 2 hours to 4 seconds, not quite 0, but very bad!
Thatâs interesting, thanks. My other criticism, which is maybe not a criticism of METRâs work itself but rather a criticism of how other people interpret it, is just how narrow these tasks are. To infer that being able to do longer and longer tasks implies rapid AGI progress seems like it would require several logical inference steps between the graph and that conclusion, and I donât think Iâve ever seen anyone spell out the logic. Which tasks you look at and how they are graded is another crucial thing to consider.
I donât understand the details of the coding, math, or question and answer (Q&A or QA) benchmarks, but is the âtimeâ dimension of these not just LLMs producing larger outputs, i.e., using more tokens? And, if so, why would LLMs performing tasks that use more tokens, although it may well be a sign of LLM improvement, indicate anything about AGI progress?
So many more tasks would seem way more interesting to me if weâre trying to assess AGI progress, such as:
Picking stocks and beating the market
Coming up with a new (and correct) idea in science, technology, engineering, math, medicine, economics, social science, etc.
Playing a new video game that just came out with no special instructions and no walkthroughs
If you only choose the kind of tasks that are least challenging for LLMs and you donât choose the kind of tasks that tend to confound LLMs but whose successful completion would be indicative of the sort of capabilities that would be required for AGI, then arenât you just measuring LLM progress and not AGI progress?
This bugs me because Iâve seen people post the METR time horizon graph as if it just obviously indicates rapid progress toward AGI, but, to me, it obviously doesnât indicate that at all. I imagine if you asked AI researchers or cognitive scientists, you would get a lot of people agreeing that the graph doesnât indicate rapid progress toward AGI.
I mean, this is an amazing graph and a huge achievement for DeepMind, but it isnât evidence of rapid progress toward AGI:
Itâs just the progress of one AI system, AlphaStar, on one task, StarCraft.
You could make a similar graph for MuZero showing performance on 60 different tasks, namely, chess, shogi, and go, plus 57 Atari games. What does that show?
If you make a graph with a few kinds of tasks on it that LLMs are good at, like coding, math, and question and answer on automatically gradable benchmarks, what does that show? How do you logically connect that to AGI?
Incidentally, how good is o3 (or any other LLM) at chess, shogi, go, and Atari games? Or at StarCraft? If weâre making progress toward artificial general intelligence, shouldnât one system be able to do all of these things?
DeepMind announced Gato in 2022, which tried to combine as many things as possible. But (correct me if Iâm wrong) Gato was worse at all those things than models trained to do just one or a few of them. So, thatâs anti-general artificial intelligence.
I just see so much unclear reasoning when it comes to this sort of thing. The sort of objections Iâm bringing up are not crazy complicated or esoteric. To me they just seem straightforward and logical. I imagine you would hear these kinds of objections or better ones if you asked some impartial AI researchers if they thought the METR graph was strong evidence for rapid progress toward AGI, indicating AGI is likely within a decade. Iâm sure somebody somewhere has stated these kind of objections before. So, what gives?
The whole point of my post above is that the majority of AI experts think a) LLMs probably wonât scale to AGI and b) AGI is probably at least 20 years away, if not much longer. So, why donât people in EA engage more with experts who think these things and ask them why they think that? Iâm sure they could do a much better job coming up with objections than me. Then you can either accept the objections are right and change your mind or come up with a convincing reply to the objections. But to just not anticipate the objections or not talk to people who are knowledgeable enough to raise objections is very strange. What kind of research is that? Thatâs so unrigorous! Whereâs the due diligence?
I think EA is in this weird, crazy filter bubble/âecho chamber when it comes to AGI where if thereâs any expert consensus on AGI, itâs against the assertions many people in EA make about AGI.[1] And if you try to point out the fairly obvious objections or problems with the arguments or evidence put forward, sometimes people can be incredibly scornful.[2] I think if people do this, they should just write in a permanent marker on their forehead âI have bad epistemic practicesâ. Then theyâll at least save people some time from trying to talk to them. Because personally attacking or insulting someone just for disagreeing with them (as most experts do, I might add!!) means they donât want to hear the evidence or reasoning that might change their mind. Or maybe they do want to, on some level, but, for whatever reason, theyâre sure acting like they donât.
I am trying to nudge in my very small way people in EA to apply more rigour to their arguments and evidence and to stimulate curiosity about what the dissenting views are. E.g., if you learn most AI experts disagree with you and you didnât know that before, that should make you curious about why most experts disagree, and should introduce at least a little doubt about your own views.
If anyone has suggestions on how to stimulate more curiosity about dissenting views on near-term AGI among people in EA, please leave a reply or message me.
As I mentioned in the post, 76% of AI experts think itâs unlikely or very unlikely that current AI methods will scale to AGI, but this comment from Steven Byrnes a while ago that said when he talked to âa couple dozen AI safety /â alignment researchers at EAG bay areaâ, he spoke to âa number of people, all quite new to the fields of AI and AI safety /â alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models wonât scale to AGIâ. Thatâs bananas. Thatâs a huge problem. How can the three-quarters majority view in a field not even be known as a concept to people in EA who are trying to work in that field?
Some recent examples here (the comment) and here (the downvotes) which are comparatively mild because I honestly canât stomach looking at the worse examples right now.
I couldnât find any reliable information for Waymo, but I found a 2023 article where a Cruise spokesperson said there was one human doing remote assistance for every 15 to 20 autonomous vehicles. That article cites a New York Times article that says a human intervention was needed every 2.5 to 5 miles.
Thatâs very helpful! Iâm guessing that since 2023, a person could manage more vehicles than that, and it will continue to improve. But letâs just work with that number. I donât think the pay for a person solving problems for Waymos would need to be that much higher than taxi drivers but even if it is 1.5x or 2x, that would still be an order of magnitude reduction in labor cost. Of course there is additional hardware cost for autonomous vehicles, but I think that can be paid off with a reasonable duty cycle. So then if you grant that the geofencing is less than a 20x safety advantage, I think there is an economic case for the chimeras, as you say.
I donât understand the details of the coding, math, or question and answer (Q&A or QA) benchmarks, but is the âtimeâ dimension of these not just LLMs producing larger outputs, i.e., using more tokens? And, if so, why would LLMs performing tasks that use more tokens, although it may well be a sign of LLM improvement, indicate anything about AGI progress?
My understanding is that 50% success on a 1 hour task means the human expert takes ~1 hour, but the computer time is not specified.
As for AGI, I do think that some people conflate the 50% success time horizon at coding tasks reaching ~one month as AGI, when it really only means superhuman coder (I think something like 50% success rate is appropriate for long coding tasks, because humans rarely produce code that works on the first try). However, once you get to superhuman coder, the logic goes that the progress in AI research will dramatically accelerate. In AI 2027, I think there was some assumption like it would take 90 years at current progress to get something like AGI, but when things accelerate, it ends up happening in a year or two. Another worry is that you might not need AGI to have something catastrophic, because it could become superhuman in a limited number of tasks, like hacking.
Incidentally, how good is o3 (or any other LLM) at chess, shogi, go, and Atari games? Or at StarCraft? If weâre making progress toward artificial general intelligence, shouldnât one system be able to do all of these things?
Youâre right itâs not fully general, but AI systems are much more general than they were 7 years ago.
I think EA is in this weird, crazy filter bubble/âecho chamber when it comes to AGI where if thereâs any expert consensus on AGI, itâs against the assertions many people in EA make about AGI.
LessWrong is also known for having very short timelines, but I think the last survey indicated median time for AGI was 2040. So I do think that it is a vocal minority in EA and LW that have median timelines before 2030. However, I do agree with them that we should be spending significant effort on AI safety because itâs plausible that AGI comes in the next five years.
Though itâs true that the expert surveys you cite have much longer timelines, I would guess that if you took a poll of people in the AI labs (which I think would qualify as a group of expertsâbiased though they may be, they do have inside information), their median timelines would be before 2030.
I donât think the pay for a person solving problems for Waymos would need to be that much higher than taxi drivers but even if it is 1.5x or 2x, that would still be an order of magnitude reduction in labor cost. Of course there is additional hardware cost for autonomous vehicles, but I think that can be paid off with a reasonable duty cycle. So then if you grant that the geofencing is less than a 20x safety advantage, I think there is an economic case for the chimeras, as you say.
This could be true, but you have to account for other elements of the cost structure. For example, can you improve the ratio of engineers to autonomous vehicles from the current ratio of around 1:1 or 1:2 to something like 1:1000 or 1:10,000?
It seems like Waymo is using methods that scale with engineer labour rather than learning methods that scale with data and compute. So, to deploy more vehicles in more areas would require commensurately more engineers, and, in addition to being too expensive, there are simply not enough of them on Earth.
As for AGI, I do think that some people conflate the 50% success time horizon at coding tasks reaching ~one month as AGI, when it really only means superhuman coder (I think something like 50% success rate is appropriate for long coding tasks, because humans rarely produce code that works on the first try).
It doesnât mean that, either. METR has found that current frontier AI systems are worse in real world, practical use cases than not using AI at all.
Automatically gradable benchmarks generally donât seem to have much to do with the ability to do tasks in the real world. So, predicting real world performance based on benchmark performance seems to just be an invalid inference.
Anecdotally, what I hear from people who say they find AI coding assistants useful is that it saves them the time it would take to copy and paste code from Stack Exchange. I have never heard anything along the lines of âit came up with a new ideaâ or âit was creativeâ. Yet this is what human-level coding would require.
However, once you get to superhuman coder, the logic goes that the progress in AI research will dramatically accelerate. In AI 2027, I think there was some assumption like it would take 90 years at current progress to get something like AGI, but when things accelerate, it ends up happening in a year or two.
The obvious objection to this as it pertains to the initial advent of AGI, rather than to superintelligence, is that this is a chicken-and-egg problem. If you need AGI to do AGI R&D, AGI canât help you develop AGI because you havenât invented it yet. You would need a sub-human AI that can do tasks that speed up AI research and AI engineering. And that seems dubious. Is automating this kind of work not an AGI-level problem?
Youâre right itâs not fully general, but AI systems are much more general than they were 7 years ago.
I donât know if I believe this. LLMs are impressive, but their scope is fairly narrow. They have memorized most of the important digital/âdigitized text thatâs available. Their next-token prediction and everything layered on top of that like fine-tuning to follow instructions, reinforcement learning from human feedback (RLHF), and Chain of Thought results in some impressive behaviours. But they are extremely brittle. They routinely make errors on very basic tasks.
I think of LLMs more like another type of AI system that are proficient in one area, comparable to game-playing RL agents. LLMs are good at many text-related tasks (including math and coding, which are also text) but they arenât able to generalize beyond the text-related tasks they have massive amounts of training data for. They donât do well outside of text-related tasks, they donât do well with novelty, they frequently fail to reason properly, etc.
So, Iâm not sure LLMs are all that more general than previous systems like MuZero and AlphaZero.
Part of generality or generalization is that you should see positive transfer learning, i.e., having skills in some domains should improve the AI systemâs skills in other domains. But it seems like we see is the opposite. That is, it seems like we see negative transfer learning. Training an AI on many diverse, heterogeneous tasks from multiple domains seems to hurt performance. Thatâs narrowness, not generality.
LessWrong is also known for having very short timelines, but I think the last survey indicated median time for AGI was 2040. So I do think that it is a vocal minority in EA and LW that have median timelines before 2030.
Thatâs very interesting, if you remember correctly. I would be interested in seeing survey data both for LessWrong and for EA.
If you need AGI to do AGI R&D, AGI canât help you develop AGI because you havenât invented it yet. You would need a sub-human AI that can do tasks that speed up AI research and AI engineering. And that seems dubious. Is automating this kind of work not an AGI-level problem?
So, Iâm not sure LLMs are all that more general than previous systems like MuZero and AlphaZero.
I donât think you can have it both ways: A superhuman coder (that is actually competent, which you donât think AI assistants are now) is relatively narrow AI, but would accelerate AI progress. A superhuman AI researcher is more general (which would drastically speed up AI progress), but is not fully general. I would argue that LLMs now are more general than AI researcher tasks (though LLMs are currently not good at all of those tasks), because LLMs can competently discuss philosophy, economics, political science, art, history, engineering, science, etc.
For example, can you improve the ratio of engineers to autonomous vehicles from the current ratio of around 1:1 or 1:2 to something like 1:1000 or 1:10,000?
Iâm claiming that they could approach overall staff to vehicle ratio of 1:10 if the number of real-time helpers (which donât have to be engineers) and vehicles were dramatically scaled up, and thatâs enough for profitability.
I would be interested in seeing survey data both for LessWrong and for EA.
The 2023 LessWrong survey was median 2040 for singularity, and 2030 for âBy what year do you think AI will be able to do intellectual tasks that expert humans currently do?â. The second question was ambiguous, and some people put it in the past. I havenât seen a similar survey result for EAs, but I expect longer timelines than LW.
I donât think you can have it both ways: A superhuman coder (that is actually competent, which you donât think AI assistants are now) is relatively narrow AI, but would accelerate AI progress. A superhuman AI researcher is more general (which would drastically speed up AI progress), but is not fully general.
I definitely disagree with this. Hopefully what I say below will explain why.
I would argue that LLMs now are more general than AI researcher tasks (though LLMs are currently not good at all of those tasks), because LLMs can competently discuss philosophy, economics, political science, art, history, engineering, science, etc.
The general in artificial general intelligence doesnât just refer to having a large repertoire of skills. Generality is about the ability to learn to generalize beyond what a system has seen in its training data. An artificial general intelligence doesnât just need to have new skills, it needs to be able to acquire new skills, and to acquire new skills that have never existed in history before by developing them itself â just as humans do.
If a new video game comes out today, Iâm able to play that game and develop a new skill that has never existed before.[1] I will probably get the hang of it in a few minutes, with a few attempts. Thatâs general intelligence.
AlphaStar was not able to figure out how to play StarCraft using pure reinforcement learning. It just got stuck using its builders to attack the enemy, rather than figuring out how to use its builders to make buildings that produce units that attack. To figure out the basics of the game, it needed to do imitation learning on a very large dataset of human play. Then, after imitation learning, to get as good as it did, it needed to do an astronomical amount of self-play, around 60,000 years of playing StarCraft. Thatâs not general intelligence. If you need to copy a large dataset of human examples to acquire a skill and millennia of training on automatically gradable, relatively short time horizon tasks (which often donât exist in the real world), thatâs something, and itâs even something impressive, but itâs not general intelligence.
Letâs say you wanted to apply this kind of machine learning to AI R&D. The necessary conditions donât apply. You donât have a large dataset of human examples to train on. You donât have automatically gradable, relatively short time horizon tasks with which to do reinforcement learning. And if the tasks require real world feedback and canât be simulated, you certainly donât have 60,000 years.
I like what the AI researcher François Chollet has to say about this topic in this video from 11:45 to 20:00. He draws the distinction between crystallized behaviours and fluid intelligence, between skills and the ability to learn skills. I think this is important. This is really what the whole topic of AGI is about.
Why have LLMs absorbed practically all text on philosophy, economics, political science, art, history, engineering, science, and so on and not come up with a single novel and correct idea of any note in any of these domains? They are not able to generalize enough to do so. They can generalize or interpolate a little bit beyond their training data, but not very much. Itâs that generalization ability (which is mostly missing in LLMs) thatâs the holy grail in AI research.
Iâm claiming that they could approach overall staff to vehicle ratio of 1:10 if the number of real-time helpers (which donât have to be engineers) and vehicles were dramatically scaled up, and thatâs enough for profitability.
There are two concepts here. One is remote human assistance, which Waymo calls fleet response. The other is Waymoâs approach to the engineering problem. I was saying that I suspect Waymoâs approach to the engineering problem doesnât scale. I think it probably relies on engineers doing too much special casing that doesnât generalize well when a modest amount of novelty is introduced. So, Waymo currently has something like 1,500 engineers to operate in the comparatively small geofenced areas where it currently operates. If it wanted to expand where it drives to a 10x larger area, would its techniques generalize to that larger area, or would it need to hire commensurately more engineers?
I suspect that Waymo faces the problem of trying to do far too much essentially by hand, just adding incremental fix after fix as problems arise. The ideal would be to, instead, apply machine learning techniques that can learn from data and generalize to new scenarios and new driving conditions. Unfortunately, current machine learning techniques do not seem to be up to that task.
The 2023 LessWrong survey was median 2040 for singularity, and 2030 for âBy what year do you think AI will be able to do intellectual tasks that expert humans currently do?â. The second question was ambiguous, and some people put it in the past.
Okay, well maybe the play testers and the game developers have developed the skill before me, but then at some point one of them had to be the first person in history to ever acquire the skill of playing that game.
So I do think that it is a vocal minority in EA and LW that have median timelines before 2030.
Now we have some data on AGI timelines for EA (though it was only 34 responses, so of course there could be large sample bias): about 15% expect it by 2030 or sooner.
But 47% (16 out of 34) put their median year no later than 2032 and 68% (23 out of 34) put their median year no later than 2035, so how significant a finding this is depends how much you care about those extra 2-5 years, I guess.
Only 12% (4 out of 34) of respondents to the poll put their median year after 2050. So, overall, respondents overwhelmingly see relatively near-term AGI (within 25 years) as at least 50% likely.
Hereâs what a California law firm blog says about it:
A Texas law firm blog says that if the lead driver stops abruptly for no reason, they could at most be found partially at fault, but not wholly at fault.
Thank you. Waymo could certainly deploy a lot more cars if supply of cars fitted with its hardware were the primary limitation. In 2018, Waymo and Jaguar announced a deal where Jaguar would produce up to 20,000 I-Pace hatchbacks for Waymo. The same year, Waymo and Chrysler announced a similar deal for up to 62,000 Pacifica minivans. Itâs 7 years later and Waymoâs fleet is still only 2,000 vehicles. I would bet Waymo winds down operations before deploying 82,000 total vehicles.
Alphabet deciding to open up Waymo to external investment is an interesting signal, especially given that Alphabet has $98 billion in cash and short-term investments. This started in 2020 and is ongoing. The more optimistic explanation I heard is that Alphabet felt some pressure from employees to give them their equity-based compensation, and that required getting a valuation for Waymo, which required external investors. A more common explanation is simply that this is cost discipline; Alphabet is seeking to reduce its cash burn. But then that also means reducing Alphabetâs own equity ownership of Waymo, and therefore its share of the future opportunity.
Something I had completely forgotten is that Waymo shut down its self-driving truck program in 2023. This is possibly a bad sign. Itâs interesting given that Aurora Innovation pivoted from autonomous cars to autonomous trucks, which I believe was on the theory that trucks would be easier. Anthony Levandowski also pivoted from cars to semi-trucks when he founded Pronto AI because semi-trucks were an easier problem, but then pivoted again to off-road dump trucks that haul rubble at mines and quarries (usually driving back and forth in a straight line over and over, in an area with no humans nearby).
I couldnât find any reliable information for Waymo, but I found a 2023 article where a Cruise spokesperson said there was one human doing remote assistance for every 15 to 20 autonomous vehicles. That article cites a New York Times article that says a human intervention was needed every 2.5 to 5 miles.
Thatâs interesting, thanks. My other criticism, which is maybe not a criticism of METRâs work itself but rather a criticism of how other people interpret it, is just how narrow these tasks are. To infer that being able to do longer and longer tasks implies rapid AGI progress seems like it would require several logical inference steps between the graph and that conclusion, and I donât think Iâve ever seen anyone spell out the logic. Which tasks you look at and how they are graded is another crucial thing to consider.
I donât understand the details of the coding, math, or question and answer (Q&A or QA) benchmarks, but is the âtimeâ dimension of these not just LLMs producing larger outputs, i.e., using more tokens? And, if so, why would LLMs performing tasks that use more tokens, although it may well be a sign of LLM improvement, indicate anything about AGI progress?
So many more tasks would seem way more interesting to me if weâre trying to assess AGI progress, such as:
Picking stocks and beating the market
Coming up with a new (and correct) idea in science, technology, engineering, math, medicine, economics, social science, etc.
Playing a new video game that just came out with no special instructions and no walkthroughs
If you only choose the kind of tasks that are least challenging for LLMs and you donât choose the kind of tasks that tend to confound LLMs but whose successful completion would be indicative of the sort of capabilities that would be required for AGI, then arenât you just measuring LLM progress and not AGI progress?
This bugs me because Iâve seen people post the METR time horizon graph as if it just obviously indicates rapid progress toward AGI, but, to me, it obviously doesnât indicate that at all. I imagine if you asked AI researchers or cognitive scientists, you would get a lot of people agreeing that the graph doesnât indicate rapid progress toward AGI.
I mean, this is an amazing graph and a huge achievement for DeepMind, but it isnât evidence of rapid progress toward AGI:
Itâs just the progress of one AI system, AlphaStar, on one task, StarCraft.
You could make a similar graph for MuZero showing performance on 60 different tasks, namely, chess, shogi, and go, plus 57 Atari games. What does that show?
If you make a graph with a few kinds of tasks on it that LLMs are good at, like coding, math, and question and answer on automatically gradable benchmarks, what does that show? How do you logically connect that to AGI?
Incidentally, how good is o3 (or any other LLM) at chess, shogi, go, and Atari games? Or at StarCraft? If weâre making progress toward artificial general intelligence, shouldnât one system be able to do all of these things?
DeepMind announced Gato in 2022, which tried to combine as many things as possible. But (correct me if Iâm wrong) Gato was worse at all those things than models trained to do just one or a few of them. So, thatâs anti-general artificial intelligence.
I just see so much unclear reasoning when it comes to this sort of thing. The sort of objections Iâm bringing up are not crazy complicated or esoteric. To me they just seem straightforward and logical. I imagine you would hear these kinds of objections or better ones if you asked some impartial AI researchers if they thought the METR graph was strong evidence for rapid progress toward AGI, indicating AGI is likely within a decade. Iâm sure somebody somewhere has stated these kind of objections before. So, what gives?
The whole point of my post above is that the majority of AI experts think a) LLMs probably wonât scale to AGI and b) AGI is probably at least 20 years away, if not much longer. So, why donât people in EA engage more with experts who think these things and ask them why they think that? Iâm sure they could do a much better job coming up with objections than me. Then you can either accept the objections are right and change your mind or come up with a convincing reply to the objections. But to just not anticipate the objections or not talk to people who are knowledgeable enough to raise objections is very strange. What kind of research is that? Thatâs so unrigorous! Whereâs the due diligence?
I think EA is in this weird, crazy filter bubble/âecho chamber when it comes to AGI where if thereâs any expert consensus on AGI, itâs against the assertions many people in EA make about AGI.[1] And if you try to point out the fairly obvious objections or problems with the arguments or evidence put forward, sometimes people can be incredibly scornful.[2] I think if people do this, they should just write in a permanent marker on their forehead âI have bad epistemic practicesâ. Then theyâll at least save people some time from trying to talk to them. Because personally attacking or insulting someone just for disagreeing with them (as most experts do, I might add!!) means they donât want to hear the evidence or reasoning that might change their mind. Or maybe they do want to, on some level, but, for whatever reason, theyâre sure acting like they donât.
I am trying to nudge in my very small way people in EA to apply more rigour to their arguments and evidence and to stimulate curiosity about what the dissenting views are. E.g., if you learn most AI experts disagree with you and you didnât know that before, that should make you curious about why most experts disagree, and should introduce at least a little doubt about your own views.
If anyone has suggestions on how to stimulate more curiosity about dissenting views on near-term AGI among people in EA, please leave a reply or message me.
As I mentioned in the post, 76% of AI experts think itâs unlikely or very unlikely that current AI methods will scale to AGI, but this comment from Steven Byrnes a while ago that said when he talked to âa couple dozen AI safety /â alignment researchers at EAG bay areaâ, he spoke to âa number of people, all quite new to the fields of AI and AI safety /â alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models wonât scale to AGIâ. Thatâs bananas. Thatâs a huge problem. How can the three-quarters majority view in a field not even be known as a concept to people in EA who are trying to work in that field?
Some recent examples here (the comment) and here (the downvotes) which are comparatively mild because I honestly canât stomach looking at the worse examples right now.
Thatâs very helpful! Iâm guessing that since 2023, a person could manage more vehicles than that, and it will continue to improve. But letâs just work with that number. I donât think the pay for a person solving problems for Waymos would need to be that much higher than taxi drivers but even if it is 1.5x or 2x, that would still be an order of magnitude reduction in labor cost. Of course there is additional hardware cost for autonomous vehicles, but I think that can be paid off with a reasonable duty cycle. So then if you grant that the geofencing is less than a 20x safety advantage, I think there is an economic case for the chimeras, as you say.
My understanding is that 50% success on a 1 hour task means the human expert takes ~1 hour, but the computer time is not specified.
As for AGI, I do think that some people conflate the 50% success time horizon at coding tasks reaching ~one month as AGI, when it really only means superhuman coder (I think something like 50% success rate is appropriate for long coding tasks, because humans rarely produce code that works on the first try). However, once you get to superhuman coder, the logic goes that the progress in AI research will dramatically accelerate. In AI 2027, I think there was some assumption like it would take 90 years at current progress to get something like AGI, but when things accelerate, it ends up happening in a year or two. Another worry is that you might not need AGI to have something catastrophic, because it could become superhuman in a limited number of tasks, like hacking.
Youâre right itâs not fully general, but AI systems are much more general than they were 7 years ago.
LessWrong is also known for having very short timelines, but I think the last survey indicated median time for AGI was 2040. So I do think that it is a vocal minority in EA and LW that have median timelines before 2030. However, I do agree with them that we should be spending significant effort on AI safety because itâs plausible that AGI comes in the next five years.
Though itâs true that the expert surveys you cite have much longer timelines, I would guess that if you took a poll of people in the AI labs (which I think would qualify as a group of expertsâbiased though they may be, they do have inside information), their median timelines would be before 2030.
This could be true, but you have to account for other elements of the cost structure. For example, can you improve the ratio of engineers to autonomous vehicles from the current ratio of around 1:1 or 1:2 to something like 1:1000 or 1:10,000?
It seems like Waymo is using methods that scale with engineer labour rather than learning methods that scale with data and compute. So, to deploy more vehicles in more areas would require commensurately more engineers, and, in addition to being too expensive, there are simply not enough of them on Earth.
It doesnât mean that, either. METR has found that current frontier AI systems are worse in real world, practical use cases than not using AI at all.
Automatically gradable benchmarks generally donât seem to have much to do with the ability to do tasks in the real world. So, predicting real world performance based on benchmark performance seems to just be an invalid inference.
Anecdotally, what I hear from people who say they find AI coding assistants useful is that it saves them the time it would take to copy and paste code from Stack Exchange. I have never heard anything along the lines of âit came up with a new ideaâ or âit was creativeâ. Yet this is what human-level coding would require.
The obvious objection to this as it pertains to the initial advent of AGI, rather than to superintelligence, is that this is a chicken-and-egg problem. If you need AGI to do AGI R&D, AGI canât help you develop AGI because you havenât invented it yet. You would need a sub-human AI that can do tasks that speed up AI research and AI engineering. And that seems dubious. Is automating this kind of work not an AGI-level problem?
I donât know if I believe this. LLMs are impressive, but their scope is fairly narrow. They have memorized most of the important digital/âdigitized text thatâs available. Their next-token prediction and everything layered on top of that like fine-tuning to follow instructions, reinforcement learning from human feedback (RLHF), and Chain of Thought results in some impressive behaviours. But they are extremely brittle. They routinely make errors on very basic tasks.
I think of LLMs more like another type of AI system that are proficient in one area, comparable to game-playing RL agents. LLMs are good at many text-related tasks (including math and coding, which are also text) but they arenât able to generalize beyond the text-related tasks they have massive amounts of training data for. They donât do well outside of text-related tasks, they donât do well with novelty, they frequently fail to reason properly, etc.
So, Iâm not sure LLMs are all that more general than previous systems like MuZero and AlphaZero.
Part of generality or generalization is that you should see positive transfer learning, i.e., having skills in some domains should improve the AI systemâs skills in other domains. But it seems like we see is the opposite. That is, it seems like we see negative transfer learning. Training an AI on many diverse, heterogeneous tasks from multiple domains seems to hurt performance. Thatâs narrowness, not generality.
Thatâs very interesting, if you remember correctly. I would be interested in seeing survey data both for LessWrong and for EA.
I donât think you can have it both ways: A superhuman coder (that is actually competent, which you donât think AI assistants are now) is relatively narrow AI, but would accelerate AI progress. A superhuman AI researcher is more general (which would drastically speed up AI progress), but is not fully general. I would argue that LLMs now are more general than AI researcher tasks (though LLMs are currently not good at all of those tasks), because LLMs can competently discuss philosophy, economics, political science, art, history, engineering, science, etc.
Iâm claiming that they could approach overall staff to vehicle ratio of 1:10 if the number of real-time helpers (which donât have to be engineers) and vehicles were dramatically scaled up, and thatâs enough for profitability.
The 2023 LessWrong survey was median 2040 for singularity, and 2030 for âBy what year do you think AI will be able to do intellectual tasks that expert humans currently do?â. The second question was ambiguous, and some people put it in the past. I havenât seen a similar survey result for EAs, but I expect longer timelines than LW.
I definitely disagree with this. Hopefully what I say below will explain why.
The general in artificial general intelligence doesnât just refer to having a large repertoire of skills. Generality is about the ability to learn to generalize beyond what a system has seen in its training data. An artificial general intelligence doesnât just need to have new skills, it needs to be able to acquire new skills, and to acquire new skills that have never existed in history before by developing them itself â just as humans do.
If a new video game comes out today, Iâm able to play that game and develop a new skill that has never existed before.[1] I will probably get the hang of it in a few minutes, with a few attempts. Thatâs general intelligence.
AlphaStar was not able to figure out how to play StarCraft using pure reinforcement learning. It just got stuck using its builders to attack the enemy, rather than figuring out how to use its builders to make buildings that produce units that attack. To figure out the basics of the game, it needed to do imitation learning on a very large dataset of human play. Then, after imitation learning, to get as good as it did, it needed to do an astronomical amount of self-play, around 60,000 years of playing StarCraft. Thatâs not general intelligence. If you need to copy a large dataset of human examples to acquire a skill and millennia of training on automatically gradable, relatively short time horizon tasks (which often donât exist in the real world), thatâs something, and itâs even something impressive, but itâs not general intelligence.
Letâs say you wanted to apply this kind of machine learning to AI R&D. The necessary conditions donât apply. You donât have a large dataset of human examples to train on. You donât have automatically gradable, relatively short time horizon tasks with which to do reinforcement learning. And if the tasks require real world feedback and canât be simulated, you certainly donât have 60,000 years.
I like what the AI researcher François Chollet has to say about this topic in this video from 11:45 to 20:00. He draws the distinction between crystallized behaviours and fluid intelligence, between skills and the ability to learn skills. I think this is important. This is really what the whole topic of AGI is about.
Why have LLMs absorbed practically all text on philosophy, economics, political science, art, history, engineering, science, and so on and not come up with a single novel and correct idea of any note in any of these domains? They are not able to generalize enough to do so. They can generalize or interpolate a little bit beyond their training data, but not very much. Itâs that generalization ability (which is mostly missing in LLMs) thatâs the holy grail in AI research.
There are two concepts here. One is remote human assistance, which Waymo calls fleet response. The other is Waymoâs approach to the engineering problem. I was saying that I suspect Waymoâs approach to the engineering problem doesnât scale. I think it probably relies on engineers doing too much special casing that doesnât generalize well when a modest amount of novelty is introduced. So, Waymo currently has something like 1,500 engineers to operate in the comparatively small geofenced areas where it currently operates. If it wanted to expand where it drives to a 10x larger area, would its techniques generalize to that larger area, or would it need to hire commensurately more engineers?
I suspect that Waymo faces the problem of trying to do far too much essentially by hand, just adding incremental fix after fix as problems arise. The ideal would be to, instead, apply machine learning techniques that can learn from data and generalize to new scenarios and new driving conditions. Unfortunately, current machine learning techniques do not seem to be up to that task.
Thank you. Well, that isnât surprising at all.
Okay, well maybe the play testers and the game developers have developed the skill before me, but then at some point one of them had to be the first person in history to ever acquire the skill of playing that game.
Quoting myself:
Now we have some data on AGI timelines for EA (though it was only 34 responses, so of course there could be large sample bias): about 15% expect it by 2030 or sooner.
But 47% (16 out of 34) put their median year no later than 2032 and 68% (23 out of 34) put their median year no later than 2035, so how significant a finding this is depends how much you care about those extra 2-5 years, I guess.
Only 12% (4 out of 34) of respondents to the poll put their median year after 2050. So, overall, respondents overwhelmingly see relatively near-term AGI (within 25 years) as at least 50% likely.