Where are the red lines for AI?

This is a cross-post from LessWrong

Thanks to Daniel Kokotajlo, Jan Hendrik Kirchner, Remmelt Ellen, Berbank Green, Otto Barten and Olaf Voß for helpful suggestions and comments.

As AI alignment remains terribly difficult and timelines appear to be dwindling, we must face the likely situation that within the next 20 years we will be able to build an AI that poses an existential threat before we know how to control it. In this case, our only chance to avert a catastrophe will be to collectively refrain from developing such a “dangerous” AI. But what exactly does that mean?

It seems obvious that an AI which pursues the wrong goal and is vastly more intelligent than any human would be “dangerous” in the sense that it would likely be unstoppable and probably lead to an existential catastrophe. But where, exactly, is the tipping point? Where is the line between harmless current AIs, like GPT-3 or MuZero, and a future AI that may pose an existential threat?

To put the question differently: If we were asked to draft a global law that prohibits creating “dangerous AI”, what should be written in it? Which are the things that no actor should ever be allowed to do, the “red lines” no one should ever cross, at least until there is a feasible and safe solution to the alignment problem? How would we even recognize a “dangerous AI”, or plans to build one?

This question is critical, because if the only way to avert an existential risk is to refrain from building a dangerous AI, we need to be very sure about what exactly makes an AI “dangerous” in this sense.

It may seem impossible to prevent all of humanity from doing something which is technically feasible. But while it is often difficult to get people to agree on any kind of policy, there are already many things which are not explicitly forbidden, but most people don’t do anyway, like letting their children play with radioactive toys, eating any unidentifiable mushrooms they find in the woods, climbing under a truck to drink wine while it is driving at full speed on the highway or drinking aquarium cleaner as a treatment against Covid. There is a common understanding that these are stupid things to do because the risk is much greater than the possible benefit. This common understanding of dangerousness is all that is needed to keep a very large proportion of humanity from doing those things.

If we could create a similar common understanding of what exactly the necessary and sufficient conditions are that turn an AI into an existential threat, I think there might be a chance that it wouldn’t be built, at least not for some time, even without a global law prohibiting it. After all, no one (apart maybe from some suicidal terrorists) would want to risk the destruction of the world they live in. There is no shareholder value to be gained from it. The expected net present value of such an investment would be hugely negative. There is no personal fame and fortune waiting for the first person to destroy the world.

Of course, it may not be so easy to define exact criteria for when an AI becomes “dangerous” in this sense. More likely there will be gray areas where the territory becomes increasingly dangerous. Still, I think it would be worthwhile to put significant effort into mapping that territory. It would help us with governing AI development and might lead to international treaties and more cautious development in some areas. In the best case, it could even help us define what “safe AI” really means, and how to use its full potential without risking our future. As an additional benefit, if a planned AI system can be identified as potentially “dangerous” beforehand, the burden of proof that their containment and control measures are fail-safe would lie with the people intending to create such a system.

In order to determine the “dangerousness” of an AI system, we should avoid the common mistake of using an anthropomorphic benchmark. When we currently talk about existential AI risks, we usually use terms like “artificial general intelligence” or “super-intelligent AI”. This seems to imply that AI gets dangerous at some point after it reaches “general problem-solving capabilities on at least human level”, so this would be a necessary condition. But this is misleading. First of all, it can lead people to underestimate the danger because they falsely equate “first arrival of dangerous AI” with “the time we fully understand the human brain”. Second, AI is already vastly super-intelligent in many narrow areas. A system that could destroy the world without being able to solve every problem on human level is at least conceivable. For example, an AI that is superhuman at strategy and persuasion could manipulate humans in a way that leads to a global nuclear war, even though it may not be able to recognize images or control a robot body in the real world. Third, as soon as an AI would gain general problem-solving capabilities on human level, it would already be vastly superhuman in many other aspects, like memory, speed of thought, access to data, ability to self-improve, etc., which might make it an invincible power. This has been illustrated in the following graphic (courtesy of AI Impact, thanks to Daniel Kokotajlo for pointing it out to me):

The points above indicate that the line between “harmless” and “dangerous” must be somewhere below the traditional threshold of “at least human problem-solving capabilities in most domains”. Even today’s narrow AIs often have significant negative, possibly even catastrophic side effects (think for example of social media algorithms pushing extremist views, amplifying divisiveness and hatred, and increasing the likelihood of nationalist governments and dictatorships, which in turn increases the risk of wars). While there are many beneficial applications of advanced AI, with the current speed of development, the possibility of things going badly wrong also increases. This makes it even more critical to determine how exactly an AI can become “dangerous”, even if it is lacking some of the capabilities typically associated with AGI.

It is beyond the scope of this post to make specific recommendations about how “dangerousness” could be defined and measured. This will require a lot more research. But there are at least some properties of an AI that could be relevant in this context:

  • Broadness (of capabilities): Today’s narrow AIs are obviously not an existential threat yet. As the broadness of domains in which a system is capable grows, however, the risk of the system exhibiting unforeseen and unwanted behavior in some domain increases. This doesn’t mean that a narrow AI is necessarily safe (see example above), but broadness of capabilities could be a factor in determining dangerousness.

  • Complexity: The more complex a system is, the more difficult it is to predict its behavior, which increases the likelihood that some of this behavior will be undesirable or even catastrophic. Therefore, all else being equal, the more complex a system is, measured for example by the number of parameters of a transformer neural network, the more dangerous.

  • Opaqueness: Some complex systems are easier to understand and predict than others. For example, symbolic AI tends to be less “opaque” than neural networks. Tools for explainability can help reduce an AI’s opaqueness. The more opaque a system is, the less predictable and the more dangerous.

  • World model: The more an AI knows about the world, the better it becomes at making plans about future world states and acting effectively to change these states, including in directions we don’t want. Therefore, the scope and precision of its knowledge about the real world may be a factor of its dangerousness.

  • Strategic awareness (as defined by Joseph Carlsmith, see section 2.1 of this document): This may be a critical factor in the dangerousness of an AI. A system with strategic awareness realizes to some extent that it is a part of its environment, a necessary element of its plan to achieve its goals, and a potential object of its own decisions. This leads to instrumental goals, like power-seeking, self-improvement, and preventing humans from turning it off or changing its main goal. The more strategically aware an AI becomes, the more dangerous.

  • Stability: A system that dynamically changes over time is less predictable, and therefore more dangerous, than a system that is stable. For example, an AI that learns in real time and is even able to self-improve should in general be considered more dangerous than a system that is trained once and then applied to a task without any further changes.

  • Computing power: The more computing power a system has, the more powerful, and therefore potentially dangerous, it becomes. This also applies to processing speed: The faster a system can decide and react, the more dangerous, because there is less time to understand its decisions and correct it if necessary.

One feature that I deliberately did not include in the list above is “connectivity to the outside world”, e.g. access to the internet, sensors, robots, or communication with humans. An AI that is connected to the internet and has access to many gadgets and points of contact can better manipulate the world and thus do dangerous things more easily. However, if an AI would be considered dangerous if it had access to some or all of these things, it should also be considered dangerous without it, because giving such a system access to the outside world, either accidentally or on purpose, could cause a catastrophe without further changing the system itself. Dynamite is considered dangerous even if there is no burning match held next to it. Restricting access to the outside world should instead be regarded as a potential measure to contain or control a potentially dangerous AI and should be seen as inherently insecure.

This list is by no means complete. There are likely other types of features, e.g. certain mathematical properties, which may be relevant but which I don’t know about or don’t understand enough to even mention them. I only want to point out that there may be objective, measurable features of an AI that could be used to determine its “dangerousness”. It is still unclear, however, how relevant these features are, how they interact with each other, and whether there are some absolute thresholds that can serve as “red lines”. I believe that further research into these questions would be very valuable.