Disentangling arguments for the importance of AI safety
I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees’ reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others—although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety.
Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will improve its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down.
This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way.
Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments—nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings.
The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways.
This is a more general version of the “inner optimiser problem”, and I think it captures the main thrust of the latter while avoiding the difficulties of defining what actually counts as an “optimiser”. I’m grateful to Nate Soares for explaining the distinction, and arguing for the importance of this problem.
The prosaic alignment problem. It is plausible that we build “prosaic AGI”, which replicates human behaviour without requiring breakthroughs in our understanding of intelligence. Shortly after they reach human level (or possibly even before), such AIs will become the world’s dominant economic actors. They will quickly come to control the most important corporations, earn most of the money, and wield enough political influence that we will be unable to coordinate to place limits on their use. Due to economic pressures, corporations or nations who slow down AI development and deployment in order to focus on aligning their AI more closely with their values will be outcompeted. As AIs exceed human-level intelligence, their decisions will become too complex for humans to understand or provide feedback on (unless we develop new techniques for doing so), and eventually we will no longer be able to correct the divergences between their values and ours. Thus the majority of the resources in the far future will be controlled by AIs which don’t prioritise human values. This argument was explained in this blog post by Paul Christiano.
More generally, aligning multiple agents with multiple humans is much harder than aligning one agent with one human, because value differences might lead to competition and conflict even between agents that are each fully aligned with some humans. (As my own speculation, it’s also possible that having multiple agents would increase the difficulty of single-agent alignment—e.g. the question “what would humans want if I didn’t manipulate them” would no longer track our values if we would counterfactually be manipulated by a different agent).
The human safety problem. This line of argument (which Wei Dai has recentlyhighlighted) claims that no human is “safe” in the sense that giving them absolute power would produce good futures for humanity in the long term, and therefore that building AI which extrapolates and implements the values of even a very altruistic human is insufficient. A prosaic version of this argument emphasises the corrupting effect of power, and the fact that morality is deeply intertwined with social signalling—however, I think there’s a stronger and more subtle version. In everyday life it makes sense to model humans as mostly rational agents pursuing their goals and values. However, this abstraction breaks down badly in more extreme cases (e.g. addictive superstimuli, unusual moral predicaments), implying that human values are somewhat incoherent. One such extreme case is running my brain for a billion years, after which it seems very likely that my values will have shifted or distorted radically, in a way that my original self wouldn’t endorse. Yet if we want a good future, this is the process which we require to go well: a human (or a succession of humans) needs to maintain broadly acceptable and coherent values for astronomically long time periods.
An obvious response is that we shouldn’t entrust the future to one human, but rather to some group of humans following a set of decision-making procedures. However, I don’t think any currently-known institution is actually much safer than individuals over the sort of timeframes we’re talking about. Presumably a committee of several individuals would have lower variance than just one, but as that committee grows you start running into well-known problems with democracy. And while democracy isn’t a bad system, it seems unlikely to be robust on the timeframe of millennia or longer. (Alex Zhu has made the interesting argument that the problem of an individual maintaining coherent values is roughly isomorphic to the problem of a civilisation doing so, since both are complex systems composed of individual “modules” which often want different things.)
While AGI amplifies the human safety problem, it may also help solve it if we can use it to decrease the value drift that would otherwise occur. Also, while it’s possible that we need to solve this problem in conjunction with other AI safety problems, it might be postponable until after we’ve achieved civilisational stability.
Note that I use “broadly acceptable values” rather than “our own values”, because it’s very unclear to me which types or extent of value evolution we should be okay with. Nevertheless, there are some values which we definitely find unacceptable (e.g. having a very narrow moral circle, or wanting your enemies to suffer as much as possible) and I’m not confident that we’ll avoid drifting into them by default.
Misuse and vulnerabilities. These might be catastrophic even if AGI always carries out our intentions to the best of its ability:
AI which is superhuman at science and engineering R&D will be able to invent very destructive weapons much faster than humans can. Humans may well be irrational or malicious enough to use such weapons even when doing so would lead to our extinction, especially if they’re invented before we improve our global coordination mechanisms. It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness. For more on the dangers from technological progress in general, see Bostrom’s paper on the vulnerable world hypothesis.
AI could be used to disrupt political structures, for example via unprecedentedly effective psychological manipulation. In an extreme case, it could be used to establish very stable totalitarianism, with automated surveillance and enforcement mechanisms ensuring an unshakeable monopoly on power for leaders.
AI could be used for large-scale projects (e.g. climate engineering to prevent global warming, or managing the colonisation of the galaxy) without sufficient oversight or verification of robustness. Software or hardware bugs might then induce the AI to make unintentional yet catastrophic mistakes.
People could use AIs to hack critical infrastructure (include the other AIs which manage aforementioned large-scale projects). In addition to exploiting standard security vulnerabilities, hackers might induce mistakes using adversarial examples or ‘data poisoning’.
Argument from large impacts. Even if we’re very uncertain about what AGI development and deployment will look like, it seems likely that AGI will have a very large impact on the world in general, and that further investigation into how to direct that impact could prove very valuable.
Weak version: development of AGI will be at least as big an economic jump as the industrial revolution, and therefore affect the trajectory of the long-term future. See Ben Garfinkel’s talk at EA Global London 2018 (which I’ll link when it’s available online). Ben noted that to consider work on AI safety important, we also need to believe the additional claim that there are feasible ways to positively influence the long-term effects of AI development—something which may not have been true for the industrial revolution. (Personally my guess is that since AI development will happen more quickly than the industrial revolution, power will be more concentrated during the transition period, and so influencing its long-term effects will be more tractable.)
Strong version: development of AGI will make humans the second most intelligent species on the planet. Given that it was our intelligence which allowed us to control the world to the large extent that we do, we should expect that entities which are much more intelligent than us will end up controlling our future, unless there are reliable and feasible ways to prevent it. So far we have not discovered any.
What should we think about the fact that there are so many arguments for the same conclusion? As a general rule, the more arguments support a statement, the more likely it is to be true. However, I’m inclined to believe that quality matters much more than quantity—it’s easy to make up weak arguments, but you only need one strong one to outweigh all of them. And this proliferation of arguments is (weak) evidence against their quality: if the conclusions of a field remain the same but the reasons given for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered socially important). This problem is exacerbated by a lack of clarity about which assumptions and conclusions are shared between arguments, and which aren’t.
On the other hand, superintelligent AGI is a very complicated topic, and so perhaps it’s natural that there are many different lines of thought. One way to put this in perspective (which I credit to Beth Barnes) is to think about the arguments which might have been given for worrying about nuclear weapons, before they had been developed. Off the top of my head, there are at least four:
They might be used deliberately.
They might be set off accidentally.
They might cause a nuclear chain reaction much larger than anticipated.
They might destabilise politics, either domestically or internationally.
And there are probably more which would have been credible at the time, but which seem silly now due to hindsight bias. So if there’d been an active anti-nuclear movement in the 30’s or early 40’s, the motivations of its members might well have been as disparate as those of AI safety advocates today. Yet the overall concern would have been (and still is) totally valid and reasonable.
I think the main takeaway from this post is that the AI safety community as a whole is still confused about the very problem we are facing. The only way to dissolve this tangle is to have more communication and clarification of the fundamental ideas in AI safety, particularly in the form of writing which is made widely available. And while it would be great to have AI safety researchers explaining their perspectives more often, I think there is still a lot of explicatory work which can be done regardless of technical background. In addition to analysis of the arguments discussed in this post, I think it would be particularly useful to see more descriptions of deployment scenarios and corresponding threat models. It would also be valuable for research agendas to highlight which problem they are addressing, and the assumptions they require to succeed.
This post has benefited greatly from feedback from Rohin Shah, Alex Zhu, Beth Barnes, Adam Marblestone, Toby Ord, and the DeepMind safety team. Also see the discussion which has taken place on LessWrong. All opinions are my own.