Disentangling arguments for the importance of AI safety

richard_ngo23 Jan 2019 14:58 UTC

63 points

AI alignment Disentanglement research AI safety

I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees’ reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others—although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety.

Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will improve its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down.
1. This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way.
2. Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments—nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings.
The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways.
1. This is a more general version of the “inner optimiser problem”, and I think it captures the main thrust of the latter while avoiding the difficulties of defining what actually counts as an “optimiser”. I’m grateful to Nate Soares for explaining the distinction, and arguing for the importance of this problem.
The prosaic alignment problem. It is plausible that we build “prosaic AGI”, which replicates human behaviour without requiring breakthroughs in our understanding of intelligence. Shortly after they reach human level (or possibly even before), such AIs will become the world’s dominant economic actors. They will quickly come to control the most important corporations, earn most of the money, and wield enough political influence that we will be unable to coordinate to place limits on their use. Due to economic pressures, corporations or nations who slow down AI development and deployment in order to focus on aligning their AI more closely with their values will be outcompeted. As AIs exceed human-level intelligence, their decisions will become too complex for humans to understand or provide feedback on (unless we develop new techniques for doing so), and eventually we will no longer be able to correct the divergences between their values and ours. Thus the majority of the resources in the far future will be controlled by AIs which don’t prioritise human values. This argument was explained in this blog post by Paul Christiano.
1. More generally, aligning multiple agents with multiple humans is much harder than aligning one agent with one human, because value differences might lead to competition and conflict even between agents that are each fully aligned with some humans. (As my own speculation, it’s also possible that having multiple agents would increase the difficulty of single-agent alignment—e.g. the question “what would humans want if I didn’t manipulate them” would no longer track our values if we would counterfactually be manipulated by a different agent).
The human safety problem. This line of argument (which Wei Dai has recently highlighted) claims that no human is “safe” in the sense that giving them absolute power would produce good futures for humanity in the long term, and therefore that building AI which extrapolates and implements the values of even a very altruistic human is insufficient. A prosaic version of this argument emphasises the corrupting effect of power, and the fact that morality is deeply intertwined with social signalling—however, I think there’s a stronger and more subtle version. In everyday life it makes sense to model humans as mostly rational agents pursuing their goals and values. However, this abstraction breaks down badly in more extreme cases (e.g. addictive superstimuli, unusual moral predicaments), implying that human values are somewhat incoherent. One such extreme case is running my brain for a billion years, after which it seems very likely that my values will have shifted or distorted radically, in a way that my original self wouldn’t endorse. Yet if we want a good future, this is the process which we require to go well: a human (or a succession of humans) needs to maintain broadly acceptable and coherent values for astronomically long time periods.
1. An obvious response is that we shouldn’t entrust the future to one human, but rather to some group of humans following a set of decision-making procedures. However, I don’t think any currently-known institution is actually much safer than individuals over the sort of timeframes we’re talking about. Presumably a committee of several individuals would have lower variance than just one, but as that committee grows you start running into well-known problems with democracy. And while democracy isn’t a bad system, it seems unlikely to be robust on the timeframe of millennia or longer. (Alex Zhu has made the interesting argument that the problem of an individual maintaining coherent values is roughly isomorphic to the problem of a civilisation doing so, since both are complex systems composed of individual “modules” which often want different things.)
2. While AGI amplifies the human safety problem, it may also help solve it if we can use it to decrease the value drift that would otherwise occur. Also, while it’s possible that we need to solve this problem in conjunction with other AI safety problems, it might be postponable until after we’ve achieved civilisational stability.
3. Note that I use “broadly acceptable values” rather than “our own values”, because it’s very unclear to me which types or extent of value evolution we should be okay with. Nevertheless, there are some values which we definitely find unacceptable (e.g. having a very narrow moral circle, or wanting your enemies to suffer as much as possible) and I’m not confident that we’ll avoid drifting into them by default.
Misuse and vulnerabilities. These might be catastrophic even if AGI always carries out our intentions to the best of its ability:
1. AI which is superhuman at science and engineering R&D will be able to invent very destructive weapons much faster than humans can. Humans may well be irrational or malicious enough to use such weapons even when doing so would lead to our extinction, especially if they’re invented before we improve our global coordination mechanisms. It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness. For more on the dangers from technological progress in general, see Bostrom’s paper on the vulnerable world hypothesis.
2. AI could be used to disrupt political structures, for example via unprecedentedly effective psychological manipulation. In an extreme case, it could be used to establish very stable totalitarianism, with automated surveillance and enforcement mechanisms ensuring an unshakeable monopoly on power for leaders.
3. AI could be used for large-scale projects (e.g. climate engineering to prevent global warming, or managing the colonisation of the galaxy) without sufficient oversight or verification of robustness. Software or hardware bugs might then induce the AI to make unintentional yet catastrophic mistakes.
4. People could use AIs to hack critical infrastructure (include the other AIs which manage aforementioned large-scale projects). In addition to exploiting standard security vulnerabilities, hackers might induce mistakes using adversarial examples or ‘data poisoning’.
Argument from large impacts. Even if we’re very uncertain about what AGI development and deployment will look like, it seems likely that AGI will have a very large impact on the world in general, and that further investigation into how to direct that impact could prove very valuable.
1. Weak version: development of AGI will be at least as big an economic jump as the industrial revolution, and therefore affect the trajectory of the long-term future. See Ben Garfinkel’s talk at EA Global London 2018 (which I’ll link when it’s available online). Ben noted that to consider work on AI safety important, we also need to believe the additional claim that there are feasible ways to positively influence the long-term effects of AI development—something which may not have been true for the industrial revolution. (Personally my guess is that since AI development will happen more quickly than the industrial revolution, power will be more concentrated during the transition period, and so influencing its long-term effects will be more tractable.)
2. Strong version: development of AGI will make humans the second most intelligent species on the planet. Given that it was our intelligence which allowed us to control the world to the large extent that we do, we should expect that entities which are much more intelligent than us will end up controlling our future, unless there are reliable and feasible ways to prevent it. So far we have not discovered any.

What should we think about the fact that there are so many arguments for the same conclusion? As a general rule, the more arguments support a statement, the more likely it is to be true. However, I’m inclined to believe that quality matters much more than quantity—it’s easy to make up weak arguments, but you only need one strong one to outweigh all of them. And this proliferation of arguments is (weak) evidence against their quality: if the conclusions of a field remain the same but the reasons given for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered socially important). This problem is exacerbated by a lack of clarity about which assumptions and conclusions are shared between arguments, and which aren’t.

On the other hand, superintelligent AGI is a very complicated topic, and so perhaps it’s natural that there are many different lines of thought. One way to put this in perspective (which I credit to Beth Barnes) is to think about the arguments which might have been given for worrying about nuclear weapons, before they had been developed. Off the top of my head, there are at least four:

They might be used deliberately.
They might be set off accidentally.
They might cause a nuclear chain reaction much larger than anticipated.
They might destabilise politics, either domestically or internationally.

And there are probably more which would have been credible at the time, but which seem silly now due to hindsight bias. So if there’d been an active anti-nuclear movement in the 30’s or early 40’s, the motivations of its members might well have been as disparate as those of AI safety advocates today. Yet the overall concern would have been (and still is) totally valid and reasonable.

I think the main takeaway from this post is that the AI safety community as a whole is still confused about the very problem we are facing. The only way to dissolve this tangle is to have more communication and clarification of the fundamental ideas in AI safety, particularly in the form of writing which is made widely available. And while it would be great to have AI safety researchers explaining their perspectives more often, I think there is still a lot of explicatory work which can be done regardless of technical background. In addition to analysis of the arguments discussed in this post, I think it would be particularly useful to see more descriptions of deployment scenarios and corresponding threat models. It would also be valuable for research agendas to highlight which problem they are addressing, and the assumptions they require to succeed.

This post has benefited greatly from feedback from Rohin Shah, Alex Zhu, Beth Barnes, Adam Marblestone, Toby Ord, and the DeepMind safety team. Also see the discussion which has taken place on LessWrong. All opinions are my own.

What links here?

Magnus Vinding's comment on Forecasting Transformative AI: Are we “trending toward” transformative AI? (How would we know?) by Holden Karnofsky (30 Aug 2021 12:09 UTC; 11 points)

richard_ngo23 Jan 2019 14:58 UTC

63 points

14 comments8 min readEA link

AI alignment Disentanglement research AI safety

Aaron Gertler 🔸 29 Jan 2019 1:33 UTC
9 points
0 ∶ 0
Strong upvote. This is exactly the kind of post I’d like to see more often on the Forum: It summarizes many different points of view without trying to persuade anyone, points out some core areas of agreement, and names people who seem to believe different things (perhaps opening lines for productive discussion in the process). Work like this will be critical for EA’s future intellectual progress.
David_Moss 23 Jan 2019 17:34 UTC
4 points
0 ∶ 0

And this proliferation of arguments is (weak) evidence against their quality: if the conclusions of a field remain the same but the reasons given for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered socially important).

I’m not sure these considerations should be too concerning in this case for a couple of reasons.

I agree that it’s concerning where “conclusions… remain the same but the reasons given for holding those conclusions change” in cases where people originally (putatively) believe p because of x, then x is shown to be a weak consideration and so they switch to citing y as a reason to believe y. But from your post it doesn’t seem like that’s necessarily what has happened, rather than a conclusion being overdetermined by multiple lines of evidence. Of course, particular people in the field may have switched between some of these reasons, having decided that some of them are not so compelling, but in the case of many of the reasons cited above, the differences between the positions seem sufficiently subtle that we should expect cases of people clarifying their own understanding by shifting to closely related positions(e.g. it seems plausible someone might reasonably switch from thinking that the main problem is knowing how to precisely describe what we value to thinking that the main problem is not knowing how to make an agent try to do that).

It also seems like a proliferation of arguments in favour of a position is not too concerning where there are plausible reasons why should expect multiple of the considerations to apply simultaneously. For example, you might think that any kind of powerful agent typically presents a threat in multiple different ways, in which case it wouldn’t be suspicious if people cited multiple distinct considerations as to why they were important.
- richard_ngo 24 Jan 2019 1:09 UTC
  2 points
  0 ∶ 0
  Parent
  I agree that it’s not too concerning, which is why I consider it weak evidence. Nevertheless, there are some changes which don’t fit the patterns you described. For example, it seems to me that newer AI safety researchers tend to consider intelligence explosions less likely, despite them being a key component of argument 1. For more details along these lines, check out the exchange between me and Wei Dai in the comments on the version of this post on the alignment forum.
- cole_haus 23 Jan 2019 20:49 UTC
  2 points
  0 ∶ 0
  Parent
  Agreed. I think these reasons seem to fit fairly easily into the following schema: Each of A, B, C, and D is necessary for a good outcome. Different people focus on failures of A, failures of B, etc. depending on which necessary criterion seems to them most difficult to satisfy and most salient.
DavidRooke 23 Jan 2019 16:38 UTC
2 points
0 ∶ 0
Hi Richard, really interesting! However I think all your 6 reasons still think of AGI as being an independent agent. What do you think of this https://www.fhi.ox.ac.uk/reframing/ by Drexler—AGI as a comprehensive set of services? To me this makes the problem much more tractable and better aligns with how we see things actually progressing.
- PeterMcCluskey 24 Jan 2019 5:58 UTC
  3 points
  0 ∶ 0
  Parent
  Drexler would disagree with some of Richard’s phrasing, but he seems to agree that most (possibly all) of (somewhat modified versions of) those 6 reasons should cause us to be somewhat worried. In particular, he’s pretty clear that powerful utility maximisers are possible and would be dangerous.
  - DavidRooke 24 Jan 2019 8:40 UTC
    1 point
    0 ∶ 0
    Parent
    Yes—we have increasingly powerful utility maximisers already and they are in many applications increasingly dangerous.
- Rohin Shah 23 Jan 2019 17:42 UTC
  3 points
  0 ∶ 0
  Parent
  I think 4, 5 and 6 are all valid even if you take the CAIS view. Could you explain how you think those depend on the AGI being an independent agent?
  Plausibly 2 and 3 also apply to CAIS, though those are more ambiguous.
  - DavidRooke 23 Jan 2019 20:49 UTC
    1 point
    0 ∶ 0
    Parent
    6 describes the AGI as a “species”—services are not a species, agents are a species. 4 and 5 as written describe the AGI as an agent—surely once the AGI is described as an “it” that is doing something certainly sounds like an independent agent to me. A service and an agent are fundamentally different in nature, they are not just a different view, as the outcome would depend on the objectives of the instructing agent.
    - richard_ngo 24 Jan 2019 1:19 UTC
      1 point
      0 ∶ 0
      Parent
      I’ve actually spent a fair while thinking about CAIS, and written up my thoughts here. Overall I’m skeptical about the framework, but if it turns out to be accurate I think that would heavily mitigate arguments 1 and 2, somewhat mitigate 3, and not affect the others very much. Insofar as 4 and 5 describe AGI as an agent, that’s mostly because it’s linguistically natural to do so—I’ve now edited some of those phrases. 6b does describe AI as a species, but it’s unclear whether that conflicts with CAIS, insofar as the claim that AI will never be agentlike is a very strong one, and I’m not sure whether Drexler makes it explicitly (I discuss this point in the blog post I linked above).
      - DavidRooke 24 Jan 2019 9:55 UTC
        2 points
        0 ∶ 0
        Parent
        “Skeptical about the framework” I do not agree with. Indeed it seems a useful model for how we as humans are. We become expert to varying degrees at a range of tasks or services through training—as we get in a car we turn on our “driving services” module (and sub modules) for example. And then underlying and separately we have our unconscious which drives the majority of our motivations as a “free agent”—our mammalian brain—which drives our socialising and norming actions, and then underneath that our limbic brain which deals with emotions like fear and status which in my experience are the things that “move the money” if they are encouraged.
        It does not seem to me we are particularly “generally intelligent”. Put in a completely unfamiliar setting without all the tools that now prop us up, we will struggle far more than a species already familiar in that environment.
        The intelligent agent approach to me takes the debate in the wrong direction, and most concerningly dramatically understates the near and present danger of utility maximising services (“this is not superintelligence”), such as this example discussed by Yuval Noah Harari and Tristan Harris.
        https://www.youtube.com/watch?v=v0sWeLZ8PXg
        Ben Pace 24 Jan 2019 20:07 UTC
        2 points
        0 ∶ 0
        Parent
        I think this is a good comment about how the brain works, but do remember that the human brain can both hunt in packs and do physics. Most systems you might build to hunt are not able to do physics, and vice versa. We’re not perfectly competent, but we’re still general.
        richard_ngo 24 Jan 2019 11:24 UTC
        2 points
        0 ∶ 0
        Parent
        I agree that the extent to which individual humans are rational agents is often overstated. Nevertheless, there are many examples of humans who spend decades striving towards distant and abstract goals, who learn whatever skills and perform whatever tasks are required to reach them, and who strategically plan around or manipulate the actions of other people. If AGI is anywhere near as agentlike as humans in the sense of possessing the long-term goal-directedness I just described, that’s cause for significant concern.
        DavidRooke 24 Jan 2019 21:21 UTC
        1 point
        0 ∶ 0
        Parent
        A lifetime learning to be a 9th Dan master at go perhaps? Building on the back of thousands of years of human knowledge and wisdom? Demolished in hours.… I still look at the game and it looks incredibly abstract!!
        Don’t get my wrong I am really concerned, I just consider the danger much closer than others, but also more soluble if we look at the right problem and ask the right questions.