A Simple Model of AGI Deployment Risk

What risks should we be willing to take when deploying superintelligent systems? Such systems could prove quite dangerous, and so we may wish to delay their use in order to improve their safety. But superintelligent systems can also be used to defend ourselves against other existential threats, such as bioweapons, nuclear weapons, or indeed, other malicious AI systems. In deciding the optimal time to deploy superintelligent systems, we must therefore trade off the risks from deploying these systems with the gain in protection they afford us once deployed.

My aim here is to develop a simple model in which to explore this trade off. While I will phrase the model around the question of when to deploy a superintelligent system, similar issues arise more generally any time we must trade state risks, which are risks from existing in a vulnerable state, against transition risks, which are risks that arise from transitioning out of the vulnerable state.^[1] For instance, developing certain biological capabilities might permanently reduce biorisk in the long-term, but cause a temporary short-livedincrease in biorisk; likewise, certain geoengineering projects might permanently reduce risks from climate change in the long-term but could potentially cause a catastrophe of its own if the deployment is botched. In each of these cases, waiting longer increases the state risk we are exposed to. But it also gives us more time to find and mitigate potential transition risks.

The Model

Let us assume that

At any given time $t$ we face a background extinction rate of $r (t)$ .
At any given time we can build a superintelligent system. Let $α (t)$ be the probability that the system we build is aligned. We assume that if the system is aligned, we end up in a “win-state”: with the help of its aligned superintelligent friend, humanity achieves its full potential. If the system is unaligned, however, humanity goes extinct.
Our only goal is to maximize the probability we end up in a “win-state”.^[2]

This model is very simplistic. In particular, we assume that we will always succeed at building a superintelligent system if we try. This assumption, however, is not as restrictive as it first appears. If we only will succeed at building a superintelligent system with some probability $p$ and otherwise fail, then so long as trying and failing costs little, the value of $p$ does not change whether trying to create a superintelligence now is net positive or negative, relative to doing nothing. We can therefore apply our model even to cases where success is not guaranteed, so long as failure is not too costly.

We also ignore the possibility of S-risks, either background or caused by our superintelligent system. It is straightforward to allow for this, but at the cost of making the model a bit more complicated.

Solving the Model

Let us begin by considering the probability $S (t)$ that humanity survives to time $t$ , assuming that no attempt is made to build a superintelligent system. This quantity satisfies the differential equation $\frac{d S (t)}{d t} = - r (t) S (t) .$ If we decide to build a superintelligent system at time $t$ , the overall probability that humanity realizes its potential is then given by $P (t) = α (t) S (t) .$ Differentiating with respect to $t$ , we find that $P^{'} (t) = S (t) [α^{'} (t) - r (t) α (t)] .$ From this we conclude that if we are at some time $t$ for which $\frac{α^{'} (t)}{α (t)} > r (t),$ then we can always improve our survival probability by waiting. Note in particular that because $α (t) \leq 1$ , it is always optimal to wait if $α^{'} (t) \geq r (t) .$ If instead we have $\frac{α^{'} (t)}{α (t)} < r (t),$ delaying superintelligent deployment decreases our survival probability.^[3] The optimal time to to deploy the system occurs when $\frac{α^{'} (t_{o})}{α (t_{o})} = r (t_{o}),$ or, in other words, when the relative gain in $α (t)$ is precisely equal to the background extinction rate.

Even without further specifying $α (t)$ and $r (t)$ , it is immediately clear that so long as the background risk is bound from below, we should always eventually deploy our superintelligent system no matter how dangerous the system is. After all, constant risk guarantees that humanity will eventually go extinct, and so any gamble, no matter how small, is preferable to this. We can make this argument more quantitative by assuming that the background extinction risk is bounded so that $r_{0}$ , in which case the optimal time to deploy satisfies the inequality: $t_{0} \leq - \frac{log α (0)}{r_{0}} .$ This can be derived by directly integrating the condition for optimal deployment, or, alternatively, can instead also be derived by noting that $P (t_{o}) \geq P (0) = α (0) .$

Quantitative Estimates

To make further progress, let us assume that the background extinction rate is some constant $r$ . Let us furthermore assume that transition risk $ϵ (t) = 1 - α (t)$ decreases exponentially, so that $\frac{d ϵ (t)}{d t} = - κ ϵ (t) .$ We then find that at the optimal time to deploy our superintelligent system, $ϵ (t_{o}) = \frac{1}{1 + \frac{κ}{r}} .$ Because the extinction risk $ϵ (t)$ monotonically decreases in our model, $ϵ (t_{o})$ also represents the greatest possible transition risk we should be willing to take: if ever we find that $ϵ (t) < \frac{1}{1 + \frac{κ}{r}},$ then we should deploy the superintelligent system, and otherwise should wait.

To get a feel for what kind of risk we should be willing to take, let us assume that the AGI deployment risk decreases by 10% per year and that the baselines extinction rate is 0.1% per year (so that $κ = 0.1$ and $r = 0.001$ ), then we find that $ϵ (t o)) = \frac{1}{101} \approx 1 %$ .

So far we have been thinking of $r$ as the background rate of extinction for all humanity. Most actors, however, are probably more interested in their own survival (or, at least, the survival of their ideals or descendants) than in the survival of humanity more broadly. For such actors, the value of $r$ they will work with is necessarily larger than the background extinction rate for humanity, and will therefore be willing to take larger risks than is optimal for humanity. For instance, a government threatened by nuclear war might take $r$ to be closer to $1 %$ even in relatively peaceful times, and so (again taking $κ = 0.1$ ) would in this model be willing to deploy a superintelligent system when $ϵ (t_{o}) = \frac{1}{11} \approx 9 %$ . Because individual actors may be willing to take larger gambles than is optimal for humanity, our model exhibits a “risk compensation effect”, whereby publically spreading a safety technique may actually increase existential risk, if it reduces $ϵ$ to below the value for which a selfish actor might rationally deploy the system, but above the value for which it is rational for humanity to deploy the system.

Conclusion

My aim here has been to develop a model to explore the optimal time to build a potentially dangerous superintelligent system. Act too quickly and the superintelligent system might be more dangerous than necessary; but if we wait too long we may instead expose ourselves to background risks that we could have used our superintelligent system to avoid. Hence, the optimal time to build a superintelligent system depends not only on the risk that such a system causes extinction, but also on both the background extinction risk and on the rate at which we can improve the safety of our superintelligent systems. While any quantitative estimate of the optimal AI risk to accept is necessarily very speculative, I was surprised by how easily risks on the order of a few percent could turn out to be a rational gamble for humanity. Rather worryingly, selfish actors who value their own survival should be willing to take even riskier gambles, such that spreading AI safety techniques may not always reduce existential risk.

Acknowledgements

I’m grateful to Abigail Thomas, Owen Cotton-Barratt, and Fin Moorhouse, for encouragement, discussions, and feedback.

↩︎
This categorization of risks is introduced by Nick Bostrom in Superintelligence, although he uses the term step risk rather than transition risk used by Toby Ord in Chapter 7 of The Precipice.
↩︎
In essence this means that we are ignoring the possibility that some “win-states” might be better much better than others, and are assuming that the value of the win-state is so great that additional pleasure or suffering which occurs before the win-state can be ignored.
↩︎
This is only true locally. In some cases it is possible that after waiting a sufficiently long time, $P (t)$ will increase again, leading to an overall increase in the probability that humanity survives.