Where I’m at with AI risk: convinced of danger but not (yet) of doom

[content: discussing AI doom. I’m sceptical about AI doom, but if dwelling on this is anxiety-inducing for you, consider skipping this post]

I’m a cause-agnostic (or more accurately ‘cause-confused’) EA with a non-technical background. A lot of my friends and writing clients are extremely worried about existential risks from AI. Many believe that humanity is more likely than not to go extinct due to AI within my lifetime.

I realised that I was confused about this, so I set myself the goal of understanding the case for AI doom, and my own scepticisms, better. I did this by (very limited!) reading, writing down my thoughts, and talking to friends and strangers (some of whom I recruited from the Bountied Rationality Facebook group—if any of you are reading, thanks again!) Tl;dr: I think there are good reasons to worry about extremely powerful AI, but I don’t yet understand why people think superintelligent AI is highly likely to end up killing everyone by default.

Why I’m writing this


I’m writing up my current beliefs and confusions in the hope that readers will be able to correct my misconceptions, clarify things I’m confused about, and link me to helpful resources. I also personally enjoy reading other EAs’ reflections about cause areas: e.g. Saulius’ post on wild animal welfare, or Nuño’s sceptical post about AI risk. This post is far less well-informed, but I found those posts valuable because of their reasoning transparency more than their authors’ expertise. I’d love to read more posts by ‘layperson’ EAs talking about their personal cause prioritisation.

I also think that ‘confusion’ is an underrepresented intellectual position. At EAGx Cambridge, Yulia Ponomarenko led a great workshop on ‘Asking daft questions with confidence’. We talked about how EAs are sometimes unwilling to ask questions that would make them less confused for fear that the questions are too basic, silly, “dumb”, or about something they’re already expected to know.

This could create a false appearance of consensus about cause areas or world models. People who are convinced by the case for AI risk will naturally be very vocal, as will those who are confidently sceptical. However, people who are unsure or confused may be unwilling to share their thoughts, either because they’re afraid that others will look down on them for not already understanding the case, or just because most people are less motivated to write about their vague confusions than their strong opinions. So I’m partly writing this as representation for the ‘generally unsure’ point of view.

Some caveats: there’s a lot I haven’t read, including many basic resources. And my understanding of the technical side of AI (maths, programming) is extremely limited. Technical friends often say ‘you don’t need to understand the technical details about AI to understand the arguments for x-risk from AI’. But when I talk and think about these questions, it subjectively feels like I run up again a lack of technical understanding quite often.

Where I’m at with AI safety

Tl;dr: I’m concerned about certain risks from misaligned or misused AI, but I don’t understand the arguments that AI will, by default and in absence of a specific alignment technique, be so misaligned as to cause human extinction (or something similarly bad.)

Convincing (to me) arguments for why AI could be dangerous

Humans could use AI to do bad things more effectively

For example, politicians could use AI to devastatingly make war on their enemies, or CEOs could use it to increase their profits in harmful or reckless ways. This seems like a good reason to regulate AI development heavily and/​or to democratise AI control, so that it’s harder for powerful people to use AI to further entrench their power.

We don’t know how AIs work, and that’s worrying


AIs are becoming freakishly powerful really fast. The capabilities of Midjourney, Gato, GPT-4, Alphafold and more are staggering. It’s worrying that even AI developers don’t really understand how this happens. Interpretability research seems super important.

AI is likely to cause societal upheaval


For example, AI might replace most human jobs over the next decades. This could lead to widespread poverty and unrest if politicians manage this transition badly. It could also cause a crisis in meaning; humans could no longer derive their self-worth or self-esteem from their ‘usefulness’ or creative talents.

We could surrender too much control to AIs


I find Andrew Critch’s ‘What multipolar failure looks like’ somewhat convincing: one story for how AI dooms us is that humans gradually surrender more and more control over our economic system to efficient, powerful AIs, and those who resist are outcompeted. Only when it’s too late will we realise that the AIs have goals in conflict with our own.

AIs of the future will be massively more intelligent and powerful than us


People sometimes say ‘as we are to ants, so will AI be to us’ (or to paraphrase Shakespeare ‘as flies to wanton boys are we to th’AIs; they kill us for their sport’). I haven’t thought deeply about this, but it’s prima facie plausible to me, and the crux of my confusion is not whether future AIs will be capable wreaking massive destruction—at least eventually.

All of this convinces me that EAs should take AI risk very seriously. It makes sense for people to fund and work on AI safety.

I’m still not sure why superintelligent AI would be existentially dangerous by default


However, many people have concerns that go further than the arguments above. Many think that superintelligent AI is likely to end up killing humans autonomously. This will happen (they argue) because the AI will be inadvertently trained to have some arbitrary goal for which killing all humans is instrumentally useful: for example, humans might interfere with the AI’s terminal goal by switching it off. ‘You can’t make coffee if you’re dead’.

I’m confused about this argument. I’m not exactly ‘sceptical’ or in disagreement; I’m just not sure that I can pass the ideological Turing test for people who believe this.

My confusion is related to:

  • what AI goals or aims “are”, and how they form

  • in what way an AI would be an agent

  • how AIs are trained or learn in the first place

Why wouldn’t AI learn constrained, complex, human-like goals?

Naively, it seems as if killing everyone would earn AI a massive penalty in training: why would it develop aims that are consistent with doing that?

My own goals include constraints such as ‘don’t murder anyone to achieve this, obviously?!’ I’m not assuming that any sufficiently-intelligent AI would necessarily have goals like this: I buy that even a superintelligent AI could have a simple, dumb goal. (In other words, I buy the orthogonality thesis). But if future AIs are trained like current ones are—by being given vast amounts of human-derived data—I’d naively expect AI goals to have the human-like property of being fuzzy, complex and constrained—even if somewhat misaligned with the trainers’ intentions.

People often point out that existing AIs are sometimes misaligned: for example, Bing’s chatbot recently made the news for threatening users who talked about it being hacked. An AI system that was trained to complete a virtual boat race learned to game the specification by going round and round in circles crashing into targets, rather than completing the course as intended. People say that we humans are misaligned with evolution’s ‘aims’: we were ‘trained’ to have sex for reproduction, but we thwart that ‘aim’ by having non-reproductive sex.

But in all these cases, the misaligned behavior is pretty similar to the intended, aligned one. We can understand how the misalignment happened. Evolution did ‘want’ us to have sex; we just luckily managed to decouple sex from reproduction. ‘Go round and round in a circle knocking over posts’ is not wildly different from ‘go round a course knocking over posts’. ‘Interact politely by default but adversarially when challenged’ is not a million miles from ‘interact politely always’; Bing was aggressive in contexts when humans would also be aggressive. (It’s not as if users were like ‘what’s the capital of France?’ and Bing was spontaneously like ‘f*** off and die, human!’)

So there still seems an inferential leap from ‘existing systems are sometimes misaligned’ to ‘superintelligent AI will most likely be catastrophically misaligned’.

AI aims seem likely to conflict with dangerous instrumentally-convergent goals


AI are likely to seek power and resist correction (the argument goes) because these goals are instrumentally useful for a wide range of terminal goals (instrumental convergence). This is true, but they aren’t useful for all terminal goals. Power-seeking, wealth-seeking, and self-protection are all instrumentally useful unless your goals include not having power, not having wealth, and not resisting human interference.

( I expect this is a common ‘why can’t we just X’ objection and already has a standard label, but if not I propose ‘why not just make your AI a suicidal communist bottom’)

Now you might say ‘well sure, but an AI that systematically avoids having power is going to be pretty useless: why would anyone develop that?’

(When I told my partner this idea, they laughed at the idea of an AI that was maximally rewarded for switching off, and therefore just kept being like ‘nope’ every time it was powered up)

But I think these arguments also apply to ‘killing all humans’. ‘Killing all humans’ is instrumentally useful for most goals—except all the goals that involve NOT killing all the humans, i.e., any goal that I’d naively expect an AI to extrapolate from being trained on billions of human actions.

Some more fragmentary questions

  • power and survival are instrumentally convergent for humans too, but not all humans maximally seek these things (even if they can). What will be different about AI? (In The Hitchhiker’s Guide to the Galaxy, Douglas Adams joked that actually, dolphins are more intelligent than humans, and the reason that they don’t dominate the planet is simply that chilling out in the ocean is much more fun)

  • according to the orthogonality thesis, you can highly-intelligently pursue an extremely dumb goal—fair enough. But I’m not sure how AI would come to understand ‘smart’ human goals without acquiring those goals, or something at least vaguely similar to those goals (i.e., goals not involving mass murder). This is because the process by which the AI is “motivated” to understand the smart goal is the same training process by which is acquires goals for itself. (I notice my lack of technical understanding is constraining my understanding, here).

I’m not sure whether these are all different confusions, or different angles on the same confusion. All of this feels like it’s in the same area, to me. I’d love to hear people’s thoughts in the comments. Feel free to send me resources that address these points. Also, as I said above, I’d love to read other people’s own versions of this post, either about AI, or about other cause areas.

I’m currently working as a freelance writer and editor. If you have a good idea for a post but don’t have the time, ability or inclination to write it up, get in touch. Thanks to everyone who has given their time and energy to discuss these questions with me over the past few months.