On Deference and Yudkowsky’s AI Risk Estimates
Note: I mostly wrote this post after Eliezer Yudkowsky’s “Death with Dignity” essay appeared on LessWrong. Since then, Jotto has written a post that overlaps a bit with this one, which sparked an extended discussion in the comments. You may want to look at that discussion as well. See also, here, for another relevant discussion thread.
EDIT: See here for some post-discussion reflections on what I think this post got right and wrong.
Most people, when forming their own views on risks from misaligned AI, have some inclination to defer to others who they respect or think of as experts.
This is a reasonable thing to do, especially if you don’t yet know much about AI or haven’t yet spent much time scrutinizing the arguments. If someone you respect has spent years thinking about the subject, and believes the risk of catastrophe is very high, then you probably should take that information into account when forming your own views.
It’s understandable, then, if Eliezer Yudkowsky’s recent writing on AI risk helps to really freak some people out. Yudkowsky has probably spent more time thinking about AI risk than anyone else. Along with Nick Bostrom, he is the person most responsible for developing and popularizing these concerns. Yudkowsky has now begun to publicly express the view that misaligned AI has a virtually 100% chance of killing everyone on Earth—such that all we can hope to do is “die with dignity.”
The purpose of this post is, simply, to argue that people should be wary of deferring too much to Eliezer Yudkowsky when it comes to estimating AI risk. In particular, I think, they shouldn’t defer to him more than they would defer to anyone else who seems smart and has spent a reasonable amount of time thinking about AI risk.
The post highlights what I regard as some negative aspects of Yudkowsky’s track record, when it comes to technological risk forecasting. I think these examples suggest that (a) his track record is at best fairly mixed and (b) he has some tendency toward expressing dramatic views with excessive confidence. As a result, I don’t personally see a strong justification for giving his current confident and dramatic views about AI risk a great deal of weight.
I agree it’s highly worthwhile to read and reflect on Yudkowsky’s arguments. I also agree that potential risks from misaligned AI deserve serious attention—and are even, plausibly, more deserving of attention than any other existential risk. I just don’t think people should make too much of the fact that Yudkowsky believes we’re doomed.
Why write this post?
Before diving in, it may be worth saying a little more about why I hope this post might be useful. (Feel free to skip ahead if you’re not interested in this section.)
In brief, it matters what the existential risk community believes about the risk from misaligned AI. I think that excessively high credences in doom can lead to:
poor prioritization decisions (underprioritizing other risks, including other potential existential risks from AI)
poor community health (anxiety and alienation)
poor reputation (seeming irrational, cultish, or potentially even threatening), which in turn can lead to poor recruitment or retention of people working on important problems
My own impression is that, although it’s sensible to take potential risks from misaligned AI very seriously, a decent number of people are now more freaked out than they need to be. And I think that excessive deference to some highly visible intellectuals in this space, like Yudkowsky, may be playing an important role—either directly or through deference cascades. I’m especially concerned about new community members, who may be particularly inclined to defer to well-known figures and who may have particularly limited awareness of the diversity of views in this space. I’ve recently encountered some anecdotes I found worrisome.
Nothing I write in this post implies that people shouldn’t freak out, of course, since I’m mostly not engaging with the substance of the relevant arguments (although I have done this elsewhere, for instance here, here, and here). If people are going to freak out about AI risk, then I at least want to help make sure that they’re freaking out for sufficiently good reasons.
Yudkowsky’s track record: some cherry-picked examples
Here, I’ve collected a number of examples of Yudkowsky making (in my view) dramatic and overconfident predictions concerning risks from technology.
Note that this isn’t an attempt to provide a balanced overview of Yudkowsky’s technological predictions over the years. I’m specifically highlighting a number of predictions that I think are underappreciated and suggest a particular kind of bias.
Doing a more comprehensive overview, which doesn’t involve specifically selecting poor predictions, would surely give a more positive impression. Hopefully this biased sample is meaningful enough, however, to support the claim that Yudkowsky’s track record is at least pretty mixed.
Also, a quick caveat: Unfortunately, but understandably, Yudkowsky didn’t have time review this post and correct any inaccuracies. In various places, I’m summarizing or giving impressions of lengthy pieces I haven’t fully read, or haven’t fully read in well more than year, so there’s a decent chance that I’ve accidentally mischaracterized some of his views or arguments. Concretely: I think there’s something on the order of a 50% chance I’ll ultimately feel I should correct something below.
Fairly clearcut examples
1. Predicting near-term extinction from nanotech
At least up until 1999, admittedly when he was still only about 20 years old, Yudkowsky argued that transformative nanotechnology would probably emerge suddenly and soon (“no later than 2010”) and result in human extinction by default. My understanding is that this viewpoint was a substantial part of the justification for founding the institute that would become MIRI; the institute was initially focused on building AGI, since developing aligned superintelligence quickly enough was understood to be the only way to manage nanotech risk:
On the nanotechnology side, we possess machines capable of producing arbitrary DNA sequences, and we know how to turn arbitrary DNA sequences into arbitrary proteins (6). We have machines—Atomic Force Probes—that can put single atoms anywhere we like, and which have recently  been demonstrated to be capable of forming atomic bonds. Hundredth-nanometer precision positioning, atomic-scale tweezers… the news just keeps on piling up…. If we had a time machine, 100K of information from the future could specify a protein that built a device that would give us nanotechnology overnight….
If you project on a graph the minimum size of the materials we can manipulate, it reaches the atomic level—nanotechnology—in I forget how many years (the page vanished), but I think around 2035. This, of course, was before the time of the Scanning Tunnelling Microscope and “IBM” spelled out in xenon atoms. For that matter, we now have the artificial atom (“You can make any kind of artificial atom—long, thin atoms and big, round atoms.”), which has in a sense obsoleted merely molecular nanotechnology—the surest sign that nanotech is just around the corner. I believe Drexler is now giving the ballpark figure of 2013. My own guess would be no later than 2010…
Above all, I would really, really like the Singularity to arrive before nanotechnology, given the virtual certainty of deliberate misuse—misuse of a purely material (and thus, amoral) ultratechnology, one powerful enough to destroy the planet. We cannot just sit back and wait….
Mitchell Porter calls it “The race between superweapons and superintelligence.” Human civilization will continue to change until we either create superintelligence, or wipe ourselves out. Those are the two stable states, the two “attractors”. It doesn’t matter how long it takes, or how many cycles of nanowar-and-regrowth occur before Transcendence or final extinction. If the system keeps changing, over a thousand years, or a million years, or a billion years, it will eventually wind up in one attractor or the other. But my best guess is that the issue will be settled now.”
I should, once again, emphasize that Yudkowsky was around twenty when he did the final updates on this essay. In that sense, it might be unfair to bring this very old example up.
Nonetheless, I do think this case can be treated as informative, since: the belief was so analogous to his current belief about AI (a high outlier credence in near-term doom from an emerging technology), since he had thought a lot about the subject and was already highly engaged in the relevant intellectual community, since it’s not clear when he dropped the belief, and since twenty isn’t (in my view) actually all that young. I do know a lot of people in their early twenties; I think their current work and styles of thought are likely to be predictive of their work and styles of thought in the future, even though I do of course expect the quality to go up over time.
2. Predicting that his team had a substantial chance of building AGI before 2010
In 2001, and possibly later, Yudkowsky apparently believed that his small team would be able to develop a “final stage AI” that would “reach transhumanity sometime between 2005 and 2020, probably around 2008 or 2010.”
In the first half of the 2000s, he produced a fair amount of technical and conceptual work related to this goal. It hasn’t ultimately had much clear usefulness for AI development, and, partly on the basis, my impression is that it has not held up well—but that he was very confident in the value of this work at the time.
The key points here are that:
Yudkowsky has previously held short AI timeline views that turned out to be wrong
Yudkowsky has previously held really confident inside views about the path to AGI that (at least seemingly) turned out to be wrong
More generally, Yudkowsky may have a track record of overestimating or overstating the quality of his insights into AI
Although I haven’t evaluated the work, my impression is that Yudkowsky was a key part of a Singularity Institute effort to develop a new programming language to use to create “seed AI.” He (or whoever was writing the description of the project) seems to have been substantially overconfident about its usefulness. From the section of the documentation titled “Foreword: Earth Needs Flare” (2001):
A new programming language has to be really good to survive. A new language needs to represent a quantum leap just to be in the game. Well, we’re going to be up-front about this: Flare is really good. There are concepts in Flare that have never been seen before. We expect to be able to solve problems in Flare that cannot realistically be solved in any other language. We expect that people who learn to read Flare will think about programming differently and solve problems in new ways, even if they never write a single line of Flare….Flare was created under the auspices of the Singularity Institute for Artificial Intelligence, an organization created with the mission of building a computer program far before its time—a true Artificial Intelligence. Flare, the programming language they asked for to help achieve that goal, is not that far out of time, but it’s still a special language.”
Coding a Transhuman AI
I haven’t read it, to my discredit, but “Coding a Transhuman AI 2.2” is another piece of technical writing by Yudkowsky that one could look at. The document is described as “the first serious attempt to design an AI which has the potential to become smarter than human,” and aims to “describe the principles, paradigms, cognitive architecture, and cognitive components needed to build a complete mind possessed of general intelligence.”
From a skim, I suspect there’s a good chance it hasn’t held up well—since I’m not aware of any promising later work that builds on it and since it doesn’t seem to have been written with the ML paradigm in mind—but can’t currently give an informed take.
Levels of Organization in General Intelligence
A later piece of work which I also haven’t properly read is “Levels of Organization in General Intelligence.” At least by 2005, going off of Yudkowsky’s post “So You Want to be a Seed AI Programmer,” it seems like he thought a variation of the framework in this paper would make it possible for a very small team at the Singularity Institute to create AGI:
There’s a tradeoff between the depth of AI theory, the amount of time it takes to implement the project, the number of people required, and how smart those people need to be. The AI theory we’re planning to use—not LOGI, LOGI’s successor—will save time and it means that the project may be able to get by with fewer people. But those few people will have to be brilliant…. The theory of AI is a lot easier than the practice, so if you can learn the practice at all, you should be able to pick up the theory on pretty much the first try. The current theory of AI I’m using is considerably deeper than what’s currently online in Levels of Organization in General Intelligence—so if you’ll be able to master the new theory at all, you shouldn’t have had trouble with LOGI. I know people who did comprehend LOGI on the first try; who can complete patterns and jump ahead in explanations and get everything right, who can rapidly fill in gaps from just a few hints, who still don’t have the level of ability needed to work on an AI project.
Somewhat disputable examples
I think of the previous two examples as predictions that resolved negatively. I’ll now cover a few predictions that we don’t yet know are wrong (e.g. predictions about the role of compute in developing AGI), but I think now have reason to regard as significantly overconfident.
3. Having high confidence that AI progress would be extremely discontinuous and localized and not require much compute
In his 2008 “FOOM debate” with Robin Hanson, Yudkowsky confidentally staked out very extreme positions about what future AI progress would look like—without (in my view) offering strong justifications. The past decade of AI progress has also provided further evidence against the correctness of his core predictions.
A quote from the debate, describing the median development scenario he was imaging at the time:
When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work on it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion…. (p. 436)
The idea (as I understand it) was that AI progress would have very little impact on the world, then a small team of people with a very small amount of computing power would have some key insight, then they’d write some code for an AI system, then that system would rewrite its own code, and then it would shortly after take over the world.
When pressed by his debate partner, regarding the magnitude of the technological jump he was forecasting, Yudkowsky suggested that economic output could at least plausibly rise by twenty orders-of-magnitude within not much more than a week—once the AI system has developed relevant nanotechnologies (pg. 400). To give a sense of how extreme that is: If you extrapolate twenty-orders-of-magnitude-per-week over the course of a year—although, of course, no one expected this rate to be maintained for anywhere close to a year—it is equivalent to an annual economic growth rate of (10^1000)%.
I think it’s pretty clear that this viewpoint was heavily influenced by the reigning AI paradigm at the time, which was closer to traditional programming than machine learning. The emphasis on “coding” (as opposed to training) as the means of improvement, the assumption that large amounts of compute are unnecessary, etc. seem to follow from this. A large part of the debate was Yudkowsky arguing against Hanson, who thought that Yudkowsky was underrating the importance of compute and “content” (i.e. data) as drivers of AI progress. Although Hanson very clearly wasn’t envisioning something like deep learning either, his side of the argument seems to fit better with what AI progress has looked like over the past decade. In particular, huge amounts of compute and data have clearly been central to recent AI progress and are currently commonly thought to be central—or, at least, necessary—for future progress.
In my view, the pro-FOOM essays in the debate also just offered very weak justifications for thinking that a small number of insights could allow a small programming team, with a small amount of computing power, to abruptly jump the economic growth rate up by several orders of magnitude. The main reasons that stood out to me, from the debate, are these:
It requires less than a gigabyte to store someone’s genetic information on a computer (p. 444).
The brain “just doesn’t look all that complicated” in comparison to human-made pieces of technology such as computer operating systems (p.444), on the basis of the principles that have been worked out by neuroscientists and cognitive scientists.
There is a large gap between the accomplishments of humans and chimpanzees, which Yudkowsky attributes this to a small architectural improvement: “If we look at the world today, we find that taking a little bit out of the architecture produces something that is just not in the running as an ally or a competitor when it comes to doing cognitive labor….[T]here are no branches of science where chimpanzees do better because they have mostly the same architecture and more relevant content” (p. 448).
Although natural selection can be conceptualized as implementing a simple algorithm, it was nonetheless capable of creating the human mind.
I think that Yudkowsky’s prediction—that a small amount of code, run using only a small amount of computing power, was likely to abruptly jump economic output upward by more than a dozen orders-of-magnitude—was extreme enough to require very strong justifications. My view is that his justifications simply weren’t that strong. Given the way AI progress has looked over the past decade, his prediction also seems very likely to resolve negatively.
4. Treating early AI risk arguments as close to decisive
In my view, the arguments for AI risk that Yudkowsky had developed by the early 2010s had a lot of very important gaps. They were suggestive of a real risk, but were still far from worked out enough to justify very high credences in extinction from misaligned AI. Nonetheless, Yudkowsky recalls his credence in doom was “around the 50% range” at the time, and his public writing tended to suggest that he saw the arguments as very tight and decisive.
These slides summarize what I see as gaps in the AI risk argument that appear in Yudkowsky’s essays/papers and in Superintelligence, which presents somewhat fleshed out and tweaked versions of Yudkowsky’s arguments. This podcast episode covers most of the same points. (Note that almost none of these objections I walk through are entirely original to me.)
You can judge for yourself whether these criticisms of his arguments fair. If they seem unfair to you, then, of course, you should disregard this as an illustration of an overconfident prediction. One additional piece of evidence, though, is that his arguments focused on a fairly specific catastrophe scenario that most researchers now assign less weight to than they did when they first entered the field.
For instance, the classic arguments treated used an extremely sudden “AI takeoff” as a central premise. Arguably, fast takeoff was the central premise, since presentations of the risk often began by establishing that there is likely to be a fast take-off (and thus an opportunity for a decisive strategic advantage) and then built the remainder of the argument on top of this foundation. However, many people in the field have now moved away from finding sudden take-off arguments compelling (e.g. for the kinds of reasons discussed here and here).
My point, here, is not necessarily that Yudkowsky was wrong, but rather that he held a much higher credence in existential risk from AI than his arguments justified at the time. The arguments had pretty crucial gaps that still needed to be resolved, but, I believe, his public writing tended to suggest that these arguments were tight and sufficient to justify very high credences in doom.
5. Treating “coherence arguments” as forceful
In the mid-2010s, some arguments for AI risk began to lean heavily on “coherence arguments” (i.e. arguments that draw implications from the von Neumann-Morgenstern utility theorem) to support the case for AI risk. See, for instance, this introduction to AI risk from 2016, by Yudkowsky, which places a coherence argument front and center as a foundation for the rest of the presentation. I think it’s probably fair to guess that the introduction-to-AI-risk talk that Yudkowsky was giving in 2016 contained what he regarded as the strongest concise arguments available.
However, later analysis has suggested that coherence arguments have either no or very limited implications for how we should expect future AI systems to behave. See Rohin Shah’s (I think correct) objection to the use of “coherence arguments” to support AI risk concerns. See also similar objections by Richard Ngo and Eric Drexler (Section 6.4).
Unfortunately, this is another case where the significance of this example depends on how much validity you assign to a given critique. In my view, the critique is strong. However, I’m unsure what portion of alignment researchers currently agree with me. I do know of at least one prominent researcher who was convinced by it; people also don’t seem to make coherence arguments very often anymore, which perhaps suggests that the critiques have gotten traction. However, if you have the time and energy, you should reflect on the critiques for yourself.
If the critique is valid, then this would be another example of Yudkowsky significantly overestimating the strength of an argument for AI risk.
[[EDIT: See here for a useful clarification by Rohin.]]
A somewhat meta example
6. Not acknowledging his mixed track record
So far as I know, although I certainly haven’t read all of his writing, Yudkowsky has never (at least publicly) seemed to take into account the mixed track record outlined above—including the relatively unambiguous misses.
He has written about mistakes from early on in his intellectual life (particularly pre-2003) and has, on this basis, even made a blanket-statement disavowing his pre-2003 work. However, based on my memory and a quick re-read/re-skim, this writing is an exploration of why it took him a long time to become extremely concerned about existential risks from misaligned AI. For instance, the main issue it discusses with his plans to build AGI are that these plans didn’t take into account the difficulty and importance of ensuring alignment. This writing isn’t, I think, an exploration or acknowledgement of the kinds of mistakes I’ve listed in this post.
The fact he seemingly hasn’t taken these mistakes into account—and, if anything, tends to write in a way that suggests he holds a very high opinion of his technological forecasting track record—leads me to trust his current judgments less than I otherwise would.
To be clear, Yudkowsky isn’t asking other people to defer to him. He’s spent a huge amount of time outlining his views (allowing people to evaluate them on their merits) and has often expressed concerns about excessive epistemic deference.
A better, but still far-from-optimal approach to deference might be to give a lot of weight to the “average” view within the pool of smart people who have spent a reasonable amount of time thinking about AI risk. This still isn’t great, though, since different people do deserve different amounts of weight, and since there’s at least some reason to think that selection effects might bias this pool toward overestimating the level of risk.
It might be worth emphasizing that I’m not making any claim about the relative quality of my own track record.
To say something concrete about my current views on misalignment risk: I’m currently inclined to assign a low-to-mid-single-digits probability to existential risk from misaligned AI this century, with a lot of volatility in my views. This is of course, in some sense, still extremely high!
I think that expressing extremely high credences in existential risk (without sufficiently strong and clear justification) can also lead some people to simply dismiss the concerns. It is often easier to be taken seriously, when talking about strange and extreme things, if you express significant uncertainty. Importantly, I don’t think this means that people should ever misrepresent their levels of concern about existential risks; dishonesty seems like a really bad and corrosive policy. Still, this is one extra reason to think that it can be important to avoid overestimating risks.
Yudkowsky is obviously a pretty polarizing figure. I’d also say that some people are probably too dismissive of him, for example because they assign too much significance to his lack of traditional credentials. But it also seems clear that many people are inclined to give Yudkowsky’s views a great deal of weight. I’ve even encountered the idea that Yudkowsky is virtually the only person capable of thinking about alignment risk clearly.
I think that cherry-picking examples from someone’s forecasting track record is normally bad to do, even if you flag that you’re engaged in cherry-picking. However, I do think (or at least hope) that it’s fair in cases where someone already has a very high level of respect and frequently draws attention to their own successful predictions.
I don’t mean to suggest that the specific twenty orders-of-magnitude of growth figure was the result of deep reflection or was Yudkowsky’s median estimate. Here is the specific quote, in response to Hanson raising the twenty orders-of-magnitude-in-a-week number: “Twenty orders of magnitude in a week doesn’t sound right, unless you’re talking about the tail end after the AI gets nanotechnology. Figure more like some number of years to push the AI up to a critical point, two to six orders of magnitude improvement from there to nanotech, then some more orders of magnitude after that.” I think that my general point, that this is a very extreme prediction, stays the same even if we lower the number to ten orders-of-magnitude and assume that there will be a bit of a lag between the ‘critical point’ and the development of the relevant nanotechnology.
As an example of a failed prediction or piece of analysis on the other side of the FOOM debate, Hanson praised the CYC project—which lies far afield of the current deep learning paradigm and now looks like a clear dead end.
Yudkowsky also provides a number of arguments in favor of the view that the human mind can be massively improved upon. I think these arguments are mostly right. However, I think, they don’t have any very strong implications for the question of whether AI progress will be compute-intensive, sudden, or localized.
To probe just the relevance of this one piece of evidence, specifically, let’s suppose that it’s appropriate to use the length of a person’s genome in bits of information as an upper bound on the minimum amount of code required to produce a system that shares their cognitive abilities (excluding code associated with digital environments). This would imply that it is in principle possible to train an ML model that can do anything a given person can do, using something on the order of 10 million lines of code. But even if we accept this hypothesis—which seems quite plausible to me—it doesn’t seem to me like this implies much about the relative contributions of architecture and compute to AI progress or the extent to which progress in architecture design is driven by “deep insights.” For example, why couldn’t it be true that it is possible to develop a human-equivalent system using fewer than 10 million lines of code and also true that computing power (rather than insight) is the main bottleneck to developing such a system?
Two caveats regarding my discussion of the FOOM debate:
First, I should emphasize that, although I think Yudkowsky’s arguments were weak when it came to the central hypothesis being debated, his views were in some other regards more reasonable than his debate partner’s. See here for comments by Paul Christiano on how well various views Yudkowsky expressed in the FOOM debate have held up.
Second, it’s been a few years since I’ve read the FOOM debate—and there’s a lot in there (the book version of it is 741 pages long) - so I wouldn’t be surprised if my high-level characterization of Yudkowsky’s arguments is importantly misleading. My characterization here is based on some rough notes I took the last time I read it.
For example, it may be possible to construct very strong arguments for AI risk that don’t rely on the fast take-off assumption. However, in practice, I think it’s fair to say that the classic arguments did rely on this assumption. If the assumption wasn’t actually very justified, then, I think, it seems to follow that having a very high credence in AI risk also wasn’t justified at the time
Here’s another example of an argument that’s risen to prominence in the past few years, and plays an important role in some presentations of AI risk, that I now suspect simply might not work. This argument shows up, for example, in Yudkowsky’s recent post “AGI Ruin: A List of Lethalities,” at the top of the section outlining “central difficulties.”