AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts

Cross-Posted to LessWrong

I would like feedback on a hypothesis that has been percolating in my brain for the past few months.

Epistemic Status: I have studied AI Safety for less than 100 hours, but have been thinking about x-risk for several years.

I am concerned that even in some cases where advanced AI is aligned, the environment in which it exists may still make it unsafe.

If I am not mistaken, “AI Alignment” seems to mean getting AI to do what we want without harmful side effects, but “AI Safety” seems to imply keeping AI from harming or destroying humanity.

These two may come apart in a “Vulnerable World Scenario” in which some future technologies destroy civilization by default. This may be the case because certain technologies have an intrinsic offense bias, meaning if even a small number of humans want to kill everyone, or competing groups are willing to kill each other, those attacking will succeed and those defending will fail by default.

Offense Bias

If there is an offense bias in advanced AI, or any other technology advanced AI leads to, it is not clear that aligning AI i.e. “getting AI to do what we want” would keep us safe. If multiple world powers have advanced AI and they each order their AI to destroy their enemies and protect their own citizens, then if there is an offense bias, and it easier to attack than defend, each AI may succeed in destroying the enemy, but fail to defend its own citizens, meaning everyone dies.

Due to entropy (the universe’s in-built destruction bias,[1]) the fragility of humans, and the incredible flexibility of advanced AI, it seems quite plausible, I would even guess more likely than not, that advanced AI will constitute or enable an offense bias.

This problem is compounded when we consider the many powerful advanced technologies AI may accelerate in the near future, such as bio-technology, 3D printing, nanotechnology, advanced robotics, brain-machine interfaces, advanced internet of things applications, advanced wearable/​cyborg technologies, advanced computer viruses, black swan (unknown unknown) technologies, etc.

Due to advanced AI processes like PASTA (Process for Automating Scientific and Technological Advancement,) powerful advanced technologies could arrive and have transformative effects quite soon, and any one of them could have an offense bias, as could any combination of them, including combinations with already existing technologies such nuclear weapons and drones. This may result in a combinatorial explosion of possible offensive synergies occurring as the number of technologies increase.

Perhaps something similar could be said of defensive technologies, though I am uncertain how the balance would play put. It seems probable to me that the more advanced technologies we expect there to be, and the more powerful we expect them to be, the more concerned we should be about this possibility.

It seems quite possible many of the protective factors humans have historically possessed (social interdependence, fragility/​mortality, not overwhelmingly powerful, etc.) will break down, and so it should not be too surprising if one or more unprecedented offense biases occur.

I will next address a concept I will call “Human Alignment” which may be a way of framing solutions to a vulnerable world scenario.

Human Alignment

By “human alignment,” I mean a state of humanity in which most or all of humanity systematically cooperates to achieve positive-sum outcomes for everyone (or at a minimum are prevented from pursuing negative sum outcomes), in a way perpetually sustainable into the future. While exceedingly difficult, saving a vulnerable world from existential catastrophe may necessitate this.

Bostrom points out that if humanity retains a “wide and recognizably human distribution of motives” resulting in a multipolar world order and an “apocalyptic residual,” then even a single apocalyptic actor with access to certain advanced technology may spell the end of civilization. As mentioned, however, actors need not be apocalyptic; it may be enough that they are willing to risk destroying each other to defend themselves, or in pursuit of their own interests.

In “The Vulnerable World Hypothesis,” (VWH) a possible solution Bostrom proposes is universal surveillance of everyone at all times to prevent apocalyptic behavior. Many find this solution unpalatable, though perhaps better than extinction. This would result in humanity being (at least minimally) aligned by force.

Another possible solution is to sustainably eliminate all malicious and apocalyptic intentions, or in other words to universally create enough moral progress that no one desires to kill each other, or is willing to risk destroying humanity. Bostrom seems to dismiss this solution as intractable. I think, however, that by using systemic interventions which incorporate mildly to moderately advanced AI to re-shape the moral fitness landscape toward desirable traits, among other interventions, this may be more tractable than it seems at first glance. I wrote the rough draft of a book on such solutions (for x-risk /​ vulnerable world in general, not AI x-risk specifically) before formally discovering EA, longtermism, and the VWH. I am now trying to understand the AI x-risk landscape better to see if a vulnerable world scenario is likely given the development of advanced AI.

Conclusion

My main question is whether a vulnerable world induced AI x-risk scenario seems plausible or likely.

I think my main crux is whether AI is likely to be multi-polar, hence multiple agents have access to advanced AI.

Another factor is whether advanced AI is likely to have uneven abilities such that the ability to commit genocide or to create new dangerous technologies is developed before the ability to defend humans, predict what technologies will be dangerous, or align humanity.

I am also very curious if this is something others have talked about, and if so, I would appreciate references to these discussions.

Finally, I would greatly appreciate any thoughts on my reasoning in general, what I may be missing, and what would be promising directions for further research for me.

Thank you in advance for your feedback!

  1. ^

    By which I mean it is easier to break something than to create or fix it, not exactly the same as offense bias, but closely related