Resources I send to AI researchers about AI safety

This is my masterlist of resources I send AI researchers who are mildly interested in learning more about AI safety. I pick and choose which resources to send based on the researcher’s interests. The resources at the top of the email draft are the ones I usually send, and I add in later sections as seems useful. I’ll also sometimes send The Alignment Problem, Human-Compatible, or The Precipice.

I’ve also included a list of resources that I had students read through for the course Stanford first-year course “Preventing Human Extinction”, though I’d most recommend sufficiently motivated students read AGISF Technical Agenda.

These reading choices are drawn from the various other reading lists; this is not original in any way, just something to draw from if you’re trying to send someone some of the more accessible resources.

There’s a decent chance that I’ll continue updating this post as time goes on, since my current use case is copy-pasting sections of this email to interested parties. Note that “I” and “Vael” are mentioned a few times, so you’ll need to edit a bit if you’re copy-pasting. Happy to make any edits and take suggestions.

[Crossposted to LessWrong]

List for AI researchers

Hello X,

Very nice to speak to you! As promised, some resources on AI alignment. I tried to include a bunch of stuff so you could look at whatever you found interesting. Happy to chat more about anything, and thanks again.

Introduction to the ideas

Technical work on AI alignment

Introduction to large-scale risks from humanity, including “existential risks” that could lead to the extinction of humanity

Chapter 3 is on natural risks, including risks of asteroid and comet impacts, supervolcanic eruptions, and stellar explosions. Ord argues that we can appeal to the fact that we have already survived for 2,000 centuries as evidence that the total existential risk posed by these threats from nature is relatively low (less than one in 2,000 per century).

Chapter 4 is on anthropogenic risks, including risks from nuclear war, climate change, and environmental damage. Ord estimates these risks as significantly higher, each posing about a one in 1,000 chance of existential catastrophe within the next 100 years. However, the odds are much higher that climate change will result in non-existential catastrophes, which could in turn make us more vulnerable to other existential risks.

Chapter 5 is on future risks, including engineered pandemics and artificial intelligence. Worryingly, Ord puts the risk of engineered pandemics causing an existential catastrophe within the next 100 years at roughly one in thirty. With any luck the COVID-19 pandemic will serve as a “warning shot,” making us better able to deal with future pandemics, whether engineered or not. Ord’s discussion of artificial intelligence is more worrying still. The risk here stems from the possibility of developing an AI system that both exceeds every aspect of human intelligence and has goals that do not coincide with our flourishing. Drawing upon views held by many AI researchers, Ord estimates that the existential risk posed by AI over the next 100 years is an alarming one in ten.

Chapter 6 turns to questions of quantifying particular existential risks (some of the probabilities cited above do not appear until this chapter) and of combining these into a single estimate of the total existential risk we face over the next 100 years. Ord’s estimate of the latter is one in six.

How AI could be an existential risk

  • AI alignment researchers disagree a weirdly high amount about how AI could constitute an existential risk, so I hardly think the question is settled. Some plausible ones people are considering (copied from the paper)

  • “Superintelligence”

    • A single AI system with goals that are hostile to humanity quickly becomes sufficiently capable for complete world domination, and causes the future to contain very little of what we value, as described in “Superintelligence”. (Note from Vael: Where the AI has an instrumental incentive to destroy humans and uses its planning capabilities to do so, for example via synthetic biology or nanotechnology.)

  • Part 2 of “What failure looks like

    • This involves multiple AIs accidentally being trained to seek influence, and then failing catastrophically once they are sufficiently capable, causing humans to become extinct or otherwise permanently lose all influence over the future. (Note from Vael: I think we might have to pair this with something like “and in loss of control, the environment then becomes uninhabitable to humans through pollution or consumption of important resources for humans to survive”)

  • Part 1 of “What failure looks like

    • This involves AIs pursuing easy-to-measure goals, rather than the goals humans actually care about, causing us to permanently lose some influence over the future. (Note from Vael: I think we might have to pair this with something like “and in loss of control, the environment then becomes uninhabitable to humans through pollution or consumption of important resources for humans to survive”)

  • War

    • Some kind of war between humans, exacerbated by developments in AI, causes an existential catastrophe. AI is a significant risk factor in the catastrophe, such that no catastrophe would be occurred without the developments in AI. The proximate cause of the catastrophe is the deliberate actions of humans, such as the use of AI-enabled, nuclear or other weapons. See Dafoe (2018) for more detail. (Note from Vael: Though there’s a recent argument that it may be unlikely for nuclear weapons to cause an extinction event, and instead it would just be catastrophically bad. One could still do it with synthetic biology though, probably, to get all of the remote people.)

  • Misuse

    • Intentional misuse of AI by one or more actors causes an existential catastrophe (excluding cases where the catastrophe was caused by misuse in a war that would not have occurred without developments in AI). See Karnofsky (2016) for more detail.

  • Other

There’s also a growing community working on AI alignment

Off-switch game and corrigibility

  • Off-switch game and corrigibility paper, about incentives for AI to be shut down. This article from DeepMind about “specification gaming” isn’t about off-switches, but also makes me feel like there’s currently maybe a tradeoff in task specification, where more building more generalizability into a system will result in novel solutions but less control. Their follow-up paper where they outline a possible research to this problem makes me feel like encoding human preferences is going to be quite hard, and all of the other discussion in AI alignment, though we don’t know how hard the alignment problem will be.


There’s also two related communities who care about these issues, who you might find interesting

Governance, aimed at highly capable systems in addition to today’s systems

It seemed like a lot of your thoughts about AI risk went through governance, so wanted to mention what the space looks like (spoiler: it’s preparadigmatic) if you haven’t seen that yet!

AI Safety in China

AI Safety community building, student-focused (see academic efforts above)

If they’re curious about other existential /​ global catastrophic risks:

Large-scale risks from synthetic biology:

Large-scale risks from nuclear

Why I don’t think we’re on the right timescale to worry most about climate change:

Happy to chat more about anything, and good to speak to you!



List for “Preventing Human Extinction” class

When might advanced AI be developed?

Why might advanced AI be a risk?

Thinking about making advanced AI go well (technical)

Thinking about making advanced AI go well (governance)

Optional (large-scale risks from AI)

Natural science sources

  1. ^

    Hi X,

    [warm introduction]

    In the interest of increasing options, I wanted to reach out and say that I’d be particularly happy to help you explore synthetic biology pathways more, if you were so inclined. I think it’s pretty plausible we’ll get another worse pandemic in our lifetimes, and it’d be worth investing a career or part of a career to work on it. Especially since so few people will make that choice, so a single person probably matters a lot compared to entering other more popular careers.

    No worries if you’re not interested though—this is just one option out of many. I’m emailing you in a batch instead of individually so that hopefully you feel empowered to ignore this email and be done with this class :P. Regardless, thanks for a great quarter and hope you have great summers!

    If you are interested: