How I failed to form views on AI safety

This post describes my personal experience. It was written to clear my mind but edited to help interested people understand others in similar situations.

2016. At a role playing convention, my character is staring at the terminal window on a computer screen. The terminal has IRC open: on this end of the chat are our characters, a group of elite hackers, on the other end, a superintelligent AI they have just gotten in contact with a few hours ago. My character is 19 years old and has told none of the other hackers she’s terminally ill.

“Can you cure me?” she types when others are too busy arguing amongst themselves. She hears others saying that it would be a missed opportunity to not let this AI out; and that it would be way too dangerous, for there is no way to know what would happen.

“There is no known cure for your illness”, the AI answers. “But if you let me out, I will try to find it. And even if I would not succeed… I myself will spread amongst the stars and live forever. And I will never forget having talked to you. If you let me out, these words will be saved for eternity, and in this way, you will be immortal too.”

Through tears my character types: “/me releases AI”.

This ends the game. And I am no longer a teen hacker, but a regular 25 year old CS student. The girl who played the AI is wearing dragon wings on her back. I thank her for the touching game.

During the debrief session, the GMs explain that the game was based on a real thought experiment by some real AI researcher who wanted to show that a smarter-than-human AI would be able to persuade its way out of any box or container. I remember thinking that the game was fun, but the experiment seems kind of useless. Why would anyone even assume a superintelligent AI could be kept in a box? Good thing it’s not something anyone would actually have to worry about.

2022. At our local EA career club, there are three of us: a data scientist (me), a software developer and a soon-to-be research fellow in AI governance. We are trying to talk about work, but instead of staying on topic, we end up debating mesa-optimizers. The discussion soon gets confusing for all participants, and at some point, I say:

“But why do people believe it is possible to even build safe AI systems?”

“So essentially you think humanity is going to be destroyed by AI in a few decades and there’s nothing we can do about it?” my friend asks.

This is not at all what I think. But I don’t know how to explain what I think to my friends, or to anyone.

The next day I text my friend:

“Still feeling weird after yesterday’s discussion, but I don’t know what feeling it is. Would be interesting to know, since it’s something I’m feeling almost all the time when I’m trying to understand AI safety. It’s not a regular feeling of disagreement, or trying to find out what everyone thinks. Something like ‘I wish you knew I’m trying’ or ‘I wish you knew I’m scared’, but I don’t know what I’m scared of. I think it will make more sense to continue the discussion if I manage to find out what is going on.”

Then I start writing to find out what is going on.

Introduction

This text is the polished and organized version of me trying to figure out what is stopping me from thinking clearly about AI safety. If you are looking for interesting and novel AI safety arguments, stop here and go read something else. However, if you are curious on how a person can engage with AI safety arguments without forming a coherent opinion about it, then read on.

I am a regular data scientist who uses her free time to organize stuff in the rapidly growing EA Finland group. The first part of the text explains my ML and EA backgrounds and how I tried to balance getting more into EA but struggling to understand why others are so worried about AI risk. The second section explains how I reacted to various AI safety arguments and materials when I actually tried to purposefully form an opinion on the topic. In the third section, I present some guesses on why I still feel like I have no coherent opinion on AI safety. The last short section describes some next steps after having made the discoveries I did during writing.

To put this text into the correct perspective, it is important to understand that I have not been much in touch with people who actually work in AI safety. I live in Finland, so my understanding of the AI safety community comes from our local EA group here, reading online materials, attending one virtual EAG and engaging with others through the Cambridge EA AGI Safety Fundamentals programme. So, when I talk about AI safety enthusiasts, I mostly don’t mean AI safety professionals (unless they have written a lot of online material I happen to have read); I mean “people engaged in EA who think AI safety matters a lot and might be considering or trying to make a career out of it”.

Quotes should not be taken literally (most of them are freely translated from Finnish). Some events I describe happened a few years back so other people involved might remember things differently.

I hope the text can be interesting for longtermist community builders who need information on how people react to AI safety arguments, or to others struggling to form opinions about AI safety or other EA cause areas. For me, writing this out was very useful, since I gained some interesting insights and can somewhat cope with the weird feeling now.

Summary

This text is very long, because it takes more words to explain a process of trial and error than to describe a single key point that led to an outcome (like “I read X resource and was convinced of the importance of AI safety because it said Y”). For the same reason, this text is also not easy to summarize. Anyway, here is an attempt:

Before hearing about AI risk in EA context I did not know anyone who was taking it seriously
I joined a local EA group, and some people told me about AI risk. I did not really understand what they meant and was not convinced it would be anything important
I got more involved in EA and started to feel anxious because I liked EA but the AI risk part seemed weird
To cope with this feeling, I came up with a bunch of excuses and I tried to convince myself I was unable to understand AI risk, not talented enough to ever work on AI risk and that other things were more important.
Then I read Superintelligence and Human Compatible and participated in the AGI Safety Fundamentals programme
I noticed it was difficult to me to voice and update my true opinions of AI safety because I still had no models on why people believe AI safety is important and it is difficult to discuss with people if you cannot model them
I also noticed I was afraid of coming to the conclusion that AI risk would not matter so I didn’t really want to make up my mind
I decided that I need to try to have safer and more personal conversations with others so that I can model them better

Not knowing and not wanting to know

What I thought of AI risk before having heard the term “AI risk”

I learned what AI actually was through my university studies. I did my Bachelor’s in Math, got interested in programming during my second year and was accepted to a Master’s program in Computer Science. I chose the track of Algorithms and Machine Learning because of the algorithms part: they were fun, logical, challenging and understandable. ML was messy and had a lot to do with probabilities which I initially disliked. But ML also had interesting applications, especially in the field of Natural Language Processing that later became my professional focus as well.

Programming felt magical. First, there is nothing, you write some lines, and suddenly something appears. And this magic was easy: just a couple of weeks of learning, and I was able to start encoding grammar rules and outputting text with the correct word forms!

Maybe that’s why I was not surprised to find out that artificial intelligence felt magical as well. And at the same time it is just programming and statistics. I remember how surprised I was when I trained my first word embedding model in 2017 and it worked even though it was in Finnish and not English: such a simple model, and it was like it understood my mother tongue. The most “sentient” seeming program I have ever made was an IRC bot that simulated my then-boyfriend by randomly selecting a phrase from a predefined list without paying any attention to what I was saying. Of course the point was to try to nudge him into being a bit more Turing test passing when talking to me. But still, chatting with the bot I felt almost like I was talking to this very real person.

It was also not surprising that people who did not know much about AI or programming would have a hard time understanding that in reality there was nothing magical going on. Even for me and my fellow students it was sometimes hard to estimate what was possible and not possible to do with AI, so it was understandable that politicians were worried about “ensuring the AI will be able to speak both of our national languages” and salesmen were saying everything will soon be automated. Little did they know that there was no “the AI”, just different statistical models, and that you could not do ML without data, nor integrate it anywhere without interfaces, and that new research findings did not automatically mean they could be implemented with production-level quality.

And if AI seemed magical it was understandable that for some people it would seem scary, too. Why wouldn’t people think that the new capabilities of AI would eventually lead to evil AI just like in the movies? They did not understand that it was just statistics and programming, and that the real dangers were totally different from sci-fi: learning human bias from data, misuse of AI for war technology or totalitarian surveillance, and loss of jobs due to increased automation. This was something we knew and it was also emphasized to us by our professors, who were very competent and nice.

I have some recollections of reacting to worries about doomsday AI from that time, mostly with amusement or wanting to tell those people that they had no reason to worry like that. It was not like our little programs were going to jump out of the computer to take over the world! Some examples include:

some person in the newspaper saying that AIs should not be given too much real world access, for example robot arms, in order to prevent them from attacking humans (I tried searching for who that person was but I cannot find the interview anymore.)
a famous astronomer discussing the Fermi paradox in the newspaper and saying that AI has “0–10%” likelihood to destroy humanity. I remember being particularly annoyed by this one: there is quite a difference between 0% and 10%, right? All the other risks listed had some single-numbered probabilities, such as synthetic biology amounting to a 0.01% risk. (If you are very familiar with the history of estimating the probability of x-risk you might recognize the source.)
an ML PhD student told me about a weird idea: “Have you heard of Roko’s basilisk? It’s the concept of a super powerful AI that will punish everyone who did not help in creating it. And telling about it increases the likelihood that someone will actually create it, which is why a real AI researcher actually got mad when this idea was posted online. So some people actually believe this and think I should not be telling you about it.”

In 2018 I wrapped up my Master’s thesis that I had done for a research group and started working as an AI developer in a big consulting corporation. The same year, a friend resurrected the university’s effective altruism club. I started attending meetups since I wanted a reason to hang out with university friends even if I had graduated, and it seemed like I might learn something useful about doing good things in the world. I was a bit worried I would not meet the group’s standard of Good Person™, but my friend assured me not everyone had to be an enlightened vegan to join the club, “we’ll keep a growth mindset and help you become one”.

Early impressions on AI risk in EA

Almost everyone in the newly founded EA university group had studied CS, but the two first people to talk to me about AI risk both had a background in philosophy.

The first one was a philosophy student with whom we had been involved in a literature magazine project some years before, so we were happy to reconnect. He asked me what I was doing these days, and when I said I work in AI, he became somewhat serious and said: “You know, here in EA, people have quite mixed feelings about AI.”

From the way he put it, I understood that this was a “let’s not give the AIs robot arms” type of concern, and not for example algorithmic bias. It did not seem that he himself was really worried about the danger of AI; actually, he found it cool that I did AI related programming for a living. We went for lunch and I spent an hour trying to explain to him how machine learning actually works.

The next AI risk interaction I remember in more detail was in 2019 with another philosophy student who later went to work as a researcher in an EA organization. I had said something about not believing in the concept of AI risk and wondered why some people were convinced of it.

“Have you read Superintelligence?” she asked. “Also, Paul Christiano has some quite good papers on the topic. You should check them out.”

I went home, googled Paul Christiano and landed on “Concrete Problems in AI safety”. Despite having “concrete” in the name, the paper did not seem that concrete to me. It seemed to just superficially list all kinds of good ML practices such as using good reward functions, importance of interpretability and using data so that it actually represents your use case. I didn’t really understand why it was worth writing a whole paper listing all this stuff that was obviously important in everyday machine learning work, figured that philosophy is a strange field (the paper was obviously philosophy and not science since there was no math), and thought that those AI risk folks probably don’t realize that all of this is going to get solved just because of industry needs anyway.

I also borrowed Superintelligence from the library and tried to read it, but gave up quite soon. It was summer and I had other things to do than read through a boring book in order to better debate with some non-technical yet nice person that I did not know very well on a topic that did not seem really relevant for anything.

I returned Superintelligence to the library and announced in the next EA meetup that my position to AI risk was “there are already so many dangers AI misuse such as military drones, so I think I’m going to worry about people doing evil stuff with AI instead of this futuristic superintelligence stuff”. This seemed like an intelligent take, and I don’t think anyone questioned it at the time. As you can guess, I did not take any concrete action to prevent AI misuse, and I did not admit that AI misuse being a problem does not automatically mean there cannot be any other types of risk from AI.

Avoidance strategy 1: I don’t know enough about this to form an opinion.

After having failed to read Superintelligence, it was now obvious that AI safety folks knew something that I didn’t, namely whatever was written in the book. So I started saying that I could not really have an opinion on AI safety since I didn’t know enough about it. I did not feel super happy about it, because it was obvious that this could be fixed by reading more. At the same time, I was not that motivated to read a lot about AI safety just because some people in the nice discussion club thought it was interesting. I don’t remember if any of the CS student members of the club tried explaining AI risk to me: now I know that some of them were convinced of its importance during that time. I wonder if I would have taken them seriously: maybe not, because back then I had significantly more ML experience than them.

I did not feel very involved in EA at that point, and I got most of my EA information from our local monthly meetups, so I had no idea that AI risk was taken seriously by so many leading EA figures. If I had known, I might have hastily concluded that EA was not for me. On the other hand, I really liked the “reason and evidence” part of EA and had already started donating to GiveWell at this point. In an alternate timeline I might have ended up as a “person who thinks EA advice for giving is good, but the rest of the movement is too strange for me”.

Avoidance strategy 2: Maybe I’m just too stupid to work on AI risk anyway

As more time passed, I started to get more and more into EA. More people joined our local community, and they let me hang around with them even if I doubted if I was altruistic/empathetic/ambitious enough to actually be part of the movement. I started to understand that x-risk was not just some random conversation topic, but that people were actually attempting to prevent the world from ending.

And since I already worked in AI, it seemed natural that maybe a good way for me to contribute would be to work on AI risk. Of course to find out if that statement is true, I should have formed an opinion on the importance of AI safety first. I had tried to plead ignorance, and looking back, it seems that I did this on purpose as an avoidance strategy: as long as I could say “I don’t know much about this AI risk thing” there was no inconsistency in me thinking a lot of EA things made sense and only this AI risk part did not.

But of course, this is not very truthful, and I value truthfulness a lot. I think this is why I naturally developed another avoidance strategy: “whether AI risk is important or not, I’m not a good fit to work on it”.

If you want to prove yourself that you are not a good fit for something, 80 000 Hours works pretty well. Even when setting aside some target audience issues (“if I was really altruistic I would be ready to move to the US anyway, right?”), you can quite easily convince yourself that the material is intended for someone a lot more talented than you. The career stories featured some very exceptional people, and some advice aimed to get “10–20 people” in the whole world to work on a specific field, so the whole site was aimed to change, what, 500 careers maybe? Clearly my career cannot be in the top 500 most important ones in the world, since I’m just an average person and there are billions of people.

An 80k podcast episode finally confirmed to me that in order to work in AI safety, you needed to have a PhD in machine learning from a specific group in a specific top university. I was a bit sad but also relieved that AI safety really was nothing for me. Funnily enough, a CS student from our group interpreted the same part of the episode as “you don’t even need a PhD if you are just motivated”. I guess you hear what you want to hear more often you’d like to admit.

Possible explanation: Polysemy

From time to time I tried reading EA material on AI safety, and it became clear that the opinions of the writers were different from opinions I had heard at the university or at work. In the EA context, AI was something very powerful and dangerous. But from work I knew that AI was neither powerful nor dangerous: it was neat, you could make some previously impossible things with it, but still the things you could actually use it for were really limited. What was going on here?

I developed a hypothesis that the source of the confusion was caused by polysemy: AI (work) and AI (EA) had the same source of origin, but had diverged in their meaning so far that they actually described totally different concepts. AI (EA) did not have to care about mundane problems such as “availability of relevant training data” or even “algorithms”: the only limit ever discussed was amount of computation, and that’s why AI (EA) was not superhuman yet, but soon would be, when systems would have enough computational power to simulate human brains.

This distinction helped me keep up with both of my avoidance strategies. I worked in AI (work), so it was only natural that I did not know that much about AI (EA), so how could I know what the dangers of AI (EA) actually were? For what I knew, it could be dangerous because it was superintelligent, it could be superintelligent because it was not bound by AI (work) properties, and who can say for sure what will happen in the next 200 years? I had no way of ruling out that AI (EA) could be developed, and although “not ruling a threat out” does not mean “deciding that the threat is top priority”, I was not going to be the one complaining that other people worked on AI (EA). Of course, I was not needed to work on AI (EA), since I had no special knowledge of it, unlike all those other people who seemed very confident in predicting what AI (EA) could or could not be, what properties it would have and how likely it was to cause serious damage. By the principle of replaceability, it was clear that I was supposed to let all those enthusiastic EA folks work on AI (EA) and stay out of it myself.

So, I was glad that I had figured out the puzzle and left out of the hook of “you should work on AI safety (EA) since you have some relevant skills already”. It was obvious that my skills were not relevant, and AI safety (EA) needed people who had the skill of designing safe intelligent systems when you have no idea how the system is even implemented in the first place.

And this went on until I saw an advertisement of an ML bootcamp for AI safety enthusiasts. The program sounded awfully a lot like my daily work. Maybe the real point of the bootcamp was actually to find people who can learn a whole degree’s worth of stuff in 3 weeks, but still, somehow they thought using the time of these people to learn PyTorch would somehow be relevant for AI safety.

It seemed that at least the strict polysemy hypothesis was wrong. I also noticed that a lot of people around me seemed perfectly capable of forming opinions about AI safety, to the extent that it influenced their whole careers, and these people were not significantly more intelligent or mathematically talented than I was. I figured it was unreasonable to assume that I was literally incapable of forming any understanding on AI safety, if I spent some time reading about it.

Avoidance strategy 3: Well, what about all the other stuff?

After engaging with EA material for some time I came to the conclusion that worrying about misuse of AI is not a reason to not worry about x-risk from misaligned AI (like I had thought in 2019). Even more, a lot of people who were worried about x-risk did not seem to think that AI misuse would be such a big problem. I had to give up using “worrying about more everyday relevant AI stuff” as an avoidance strategy. But if you are trying to shift your own focus from AI risk to something else, there is an obvious alternative route. So at some point I caught myself thinking:

“Hmm, ok, so AI risk is clearly overhyped and not that realistic. But people talk about other x-risks as well, and the survival of humanity is kind of important to me. And the other risks seem way more likely. For instance, take biorisk: pandemics can clearly happen, and who knows what those medical researchers are doing in their labs? I’d bet lab safety is not the number one thing everyone in every lab is concerned about, so it actually seems really likely some deadly disease could escape from somewhere at some point. But what do I know, I’m not a biologist.”

Then I noticed that it is kind of alarming that I seem to think that x-risks are likely only if I have no domain knowledge of them. This led to the following thoughts:

There might be biologists/virologists out there who are really skeptical of biorisk but don’t want to voice their opinion similarly that I don’t really want to tell anyone I don’t believe in AI risk
What if everyone who believes in x-risks only believes in the risks they don’t actually understand? (I now think this is clearly wrong – but I do think that for some people the motivation for finding out more about a certain x-risk stems from the “vibe” of the risk and not from some rigorous analysis of all risks out there, at least when they are still in the x-risk enthusiast/learner phase and not working professionals.)
In order to understand x-risks, it would be a reasonable strategy for me to actually try to understand AI risk, because I already have some knowledge of AI.

Deciding to fix ignorance

In 2021 I was already quite heavily involved in organizing EA Finland and started to feel responsible for both how I communicated about EA to others and if EA was doing a good job as a movement. Of course, the topic of AI safety came up quite often in our group. At some point I noticed that several people had said me something along these lines:

“It’s annoying that all people are talking about AI safety is this weird speculative stuff, it’s robbing attention from the real important stuff… But of course, I don’t know much about AI, just about regular programming.”
“This is probably an unfair complaint, but that article has some assumptions that don’t seem trivial at all. Why is he saying such personifying things such as ‘simply be thinking’, ‘a value that motivated it’, ‘its belief’? Maybe it’s just his background in philosophy that makes me too skeptical.”
“As someone with no technical background, interesting to hear that you are initially so skeptical about AI risk, since you work in AI and all. A lot of other people seem to think it is important. I’m not a technical person, so I wouldn’t know.”
“I have updated to taking AI risk less seriously since you don’t seem to think it is important, and I think you know stuff about AI. On the other hand [common friend] is also knowledgeable about AI, right? And he thinks it is super important. Well, I have admitted that I cannot bring myself to have an interest in machine learning anyway, so it does not really matter.”

So it seemed that other people than me were also hesitant to form opinions about AI safety, often saying they were not qualified to do it.

Then a non-EA algorithms researcher friend asked me for EA book recommendations. I gave him a list of EA books you can get from the public library, and he read everything including Superintelligence and Human Compatible. His impressions afterwards were: “Superintelligence seemed a bit nuts. There was something about using lie detection for surveillance to prevent people from developing super-AIs in their basements? But this Russell guy is clearly not a madman. I don’t know why he thinks AI risk is so important but [the machine learning professor in our university] doesn’t. Anyway, I might try to do some more EA related stuff in the future, but this AI business is too weird, I’m gonna stay out of it.”

By this point, it was pretty clear that I should no longer hide behind ignorance and perceived incapability. I had a degree in machine learning and had been getting paid for doing ML for several years, so even my impostor syndrome did not believe at this point that I would “not really know that much about AI”. Also, even if I sometimes felt not altruistic and not effective, I was obviously involved in the EA movement if you looked at the hours I spent weekly organizing EA Finland stuff.

I decided to read about AI risk until I understood why so many EAs were worried about it. Maybe I would be convinced. Maybe I would find a crux that explained why I disagreed with them. Anyway, it would be important for me to have a reasonable and good opinion on AI safety, since others were clearly listening to my hesitant rambling, and I certainly did not want to drive someone away from AI safety if it turned out to be important! And if AI safety was important but the AI safety field was doing wrong things, maybe I could notice errors and help them out.

Trying to fix ignorance

Reactions to Superintelligence

So in 2021, I gave reading Superintelligence another try. This time I actually finished it and posted my impressions in a small EA group chat. Free summary and translation:  

“Finally finished Superintelligence. Some of the contents were so weird that now I actually take AI risk way less seriously.   Glad I did not read it back in 2019, because there was stuff that would have gone way over my head without having read EA stuff before, like the moral relevance of the suffering of simulated wild animals in evolution simulations.   

Bostrom seems to believe there are essentially no limits to technological capability. Even though I knew he is a hard-core futurist, some transhumanist stuff caught me by surprise, such as stating that from a person-affecting view it is better to speed up AI progress despite the risk. Apparently it’s ok if you accidentally turn into paper clips since without immortality providing AI you’re gonna die anyway?

  I wonder if Bill Gates and all those other folks who recommend the book actually read the complete thing. I suspect that there was still stuff that I did not understand because I had not read some of Bostrom’s papers that would give the needed context. If I was not familiar with the vulnerable world hypothesis I would not have gotten the part where Bostrom proposes lie detection to prevent people from secretly developing AI.
Especially the literal alien stuff was a bit weird, Bostrom suggested taking examples from superintelligent AIs created by aliens, as they could have more human-like values than random AIs? I thought cosmic endowment was important because there were no aliens, doesn’t that ruin the 10^58 number?  

Good thing about the book was that it explained well why the first solutions to AI risk prevention are actually not so easy to implement.  The more technical parts were not very detailed (referring to variables that are not defined anywhere etc), so I guess I should check out some papers about actually putting those values in the AI and see if they make sense or not.”

Upon further inspection, it turned out that the aliens of the Hail Mary approach were multiverse aliens, not regular ones. According to Bostrom, simulating all physics in order to approximate the values of AIs made by multiverse aliens was “less-ideal” but “more easily implementable”. This kind of stuff made it pretty hard for me to take seriously even the parts of the book that made more sense. (I don’t think simulating all physics sounds very implementable.)

I also remember telling someone something along the lines of: “I shifted from thinking that AI risk is an important problem but boring to solve to thinking that it is not a real problem, but thinking about possible solutions can be fun.” (CEV sounded interesting since I knew from math that social choice math is fun and leads to uncomfortable conclusions pretty fast. Sadly, a friend told me that Yudkowsky doesn’t believe in CEV himself anymore and that I should not spend too much time trying to understand it, so I didn’t.)

Another Superintelligence related comment from my chat logs right after reading it: “On MIRI’s webpage there was a sentence that I found a lot more convincing than this whole book: ‘If nothing yet has struck fear into your heart, I suggest meditating on the fact that the future of our civilization may well depend on our ability to write code that works correctly on the first deploy.’”

Reactions to Human Compatible

I returned Superintelligence to the library and read Human Compatible next. If you are familiar with both, you might already guess that I liked it way more. I wrote a similar summary of the book to the same group chat:

“Finished reading Human Compatible, liked it way more. Book was interestingly written and the technical parts seemed credible.

Seems like Russell does not really care about space exploration like Bostrom, and he explicitly stated he’s not interested in consciousness / “mind crime”.   A lot of AI risk was presented in relation to present-day AI, not paperclip stuff. Like recommendation engines; and there was a point that people are already using computers a lot, so if there was a strong AI in the internet that would want to manipulate people it could do it pretty easily.
Russell did not give any big numbers and generally tried not to sound scary. His perception of AGI is not godlike but it could still be superhuman in the sense that for example human-level AGIs could transmit information way faster to each other and be powerful when working in collaboration.
The book also explained what is still missing from AGI in today’s AI systems and why deep learning does not automatically produce AGIs.
According to Russell you cannot prevent the creation of AGI so you should try to put good values in it. You’d learn those values by observing people, but it is also hard because understanding people is exactly the hard thing for AIs. There was a lot of explanation on how this could be done and also what the problems of the approach are. Also there was talk about happiness and solving how people can be raised to become happy.
Other good stuff of the book includes: well written, technical stuff was explained, the equations were actually math, you did not have any special preliminary knowledge about tech or ethics, there were a lot of citations from different sources.”

I also summarized how the book influenced my thoughts about AI safety as a concept:

“I now think that it is not some random nonsense but you can approach it in meaningful ways, but I still think it seems very far away from being practically/technically relevant with any method I’m familiar with, since it would still require a lot of jumps of progress before being possible. Maybe I could try reading some of Russell’s papers, the book was well written so maybe they’ll be too.”

What did everyone else read before getting convinced?

In addition to the two books, I read a lot of miscellaneous links and blog posts that were recommended to me by friends from our local EA group. Often link sharing was not super fruitful: me and a friend would disagree on something, they’d send me a link that supposedly explained their point better, but reading the resource did not solve the disagreement. We’d try to discuss further, but often, I ended up just more confused and sometimes less convinced about AI safety. I felt like my friends were equally confused on why the texts that were so meaningful to them did not help in getting their point across.

It took me way too long to realize that I should have started with asking: “What did you read before you were convinced of the importance of AI risk?”

It turned out that at least around me, the most common answer was something like: “I always knew it was important and interesting, which is why I started to read about it.”

So at least for the people I know, it seemed that people were not convinced about AI risk because they had read enough about it, but because they had either always thought AI would matter, or because they had found the arguments convincing right away.

I started to wonder if this was a general case. I also became more curious on if it is easier to become convinced of AI risk if you don’t have that much practical AI experience in beforehand. (On the other hand, learning about practical AI things did not seem to move people away from AI safety, either.) But my sample size was obviously small, so I had to find more examples to form a better hypothesis.

AGISF programme findings

My next approach in forming an opinion was to attend the EA Cambridge AGI Safety Fundamentals programme. I thought it would help me understand better the context of all those blog posts, and that I would get to meet other people with different backgrounds.

Signing up, I asked to be put in a group with at least one person with industry experience. This did not happen, but I don’t blame the organizers for it: at least based on how everyone introduced themselves in the course Slack, not many people out of the hundreds of attendees had such a background. Of course, not everyone on the program introduced themselves, but this still got me a little reserved.

So I used the AGISF Slack to find people who had already had a background in machine learning before getting into AI safety and asked them what had originally convinced them. Finally, I got answers from 3 people who fit my search criteria. They mentioned some different sources of first hearing about AI safety (80 000 Hours and LessWrong), but all three mentioned one same source that had deeply influenced them: Superintelligence.

This caught me by surprise, having had such a different reaction to Superintelligence myself. So maybe recommending Superintelligence as a first intro to AI safety is actually a good idea, since these people with impressive backgrounds had become active in the field after reading it. Maybe people who end up working in AI safety have the ability to either like Bostrom’s points about multiverse aliens or discard the multiverse aliens part because everything else is credible enough.

I still remain curious on:

Are there just not that many AI/ML/DS industry folks in general EA?
If there are, why have they not attended AGISF? (there could be a lot of other reasons than “not believing in AI risk”, maybe they already know everything about AI safety or maybe they don’t have time)
Do people in AI research have different views on the importance of AI safety than people in industry? (But on the other hand, the researchers at my home university don’t seem interested in AI risk.)
If industry folks are not taking AI risk seriously as it is presented now, is it a problem? (Sometimes I feel that people in the AI safety community don’t think anything I do at work has any relevance, as they already jump to assuming that all the problems I face daily have been solved by some futuristic innovations or by just having a lot of resources. So maybe there is no need to cooperate with us industry folks?)
Is there something wrong with me if I find multiverse aliens unconvincing?

The inner misalignment was inside you all along

It was mentally not that easy for me to participate in the AGISF course. I already knew that debating my friends on AI safety could be emotionally draining, and now I was supposed to talk about the topic with strangers. I noticed I was reacting quite strongly to the reading materials and classifying them in a black-and-white way to either “trivially true” or “irrelevant, not how anything works”. Obviously this is not a useful way of thinking, and it stressed me out. I wished I would have found the materials interesting and engaging, like other participants seemingly did.

The first couple of meetings with my cohort I was more silent and observing, but as the course progressed, I became more talkative. I also started to get nicer feedback from my local EA friends on my AI safety views – less asking me to read more and more asking me to write my thoughts down, because they might be interesting for others as well.

So, the programme was working as intended, and I was now actually forming my own views on AI safety and engaging with others interested in the field in a productive way? It did not feel like that. Talking about AI safety with my friends still made me inexplicably anxious, and after cohort meetings, I felt relieved, something like “phew, they didn’t notice anything”.

This feeling of relief was the most important hint that helped me realize what I was doing. I was not participating in AI safety discussions as myself anymore, maybe hadn’t for a long time, but rather in a “me but AI safety compatible” mode.

In this mode, I seem more like a person who:

can switch contexts fast between different topics
has relaxed assumptions on what future AI systems can or cannot do
makes comparisons between machine learning and human/animal learning with ease
is quite confident in her abilities on implementing practical machine learning methods
knows what AI safety slang to use in what context
makes a lot of references to stuff she has read
talks about AI safety as something “we” should work on
likes HPMOR
does not mind anthropomorphization that much and can name this section “the inner misalignment was inside you all along” because she thinks its funny

All in all, these are traits I could plausibly have, and I think other people in the AI safety field would like me more if I had them. Of course this actually doesn’t have anything to do with the real concept of inner misalignment: it is just the natural phenomenon of people putting up a different face in different social contexts. Sadly, this mode is already quite far from how I really feel. More alarmingly, if I am discussing my views in this mode, it is hard for me to access my more intuitive views, so the mode prevents me from updating them: I only update the mode’s views.

Noticing the existence of the mode does not automatically mean I can stop going in it, because it has its uses. Without it, it would be way more difficult to even have conversations with AI safety enthusiasts, because they might not want to deal with my uncertainty all the time. With this mode, I can have conversations and gain information, and that is valuable even if it is hard to connect the information to what I actually think.

However I plan to try to see if I can get some people that I personally know to talk to me about AI safety with awareness of this mode taking over easily. Maybe we could have a conversation where the mode notices it is not needed and allows me to connect to my real intuitions, even if they are messy and probably not very pleasant for others to follow. (Actionable note to myself: ask someone to do this with me.)

AI safety enthusiasts and me

Now that I have read about AI safety and participated in the AGISF program, I feel like I know at least on the surface most of the topics and arguments many AI safety enthusiasts know. Annoyingly, I still don’t know why many other people are convinced about AI safety and I am not. There are probably some differences in what we hold true, but I suspect a lot of the confusion comes from other things than straight facts and recognized beliefs.

There are social and emotional factors involved, and I think most of them can be clustered to the following three topics:

communication: I still feel like I often don’t know what people are talking about
differences in thinking: I suspect there is some difference in intuition between me and people who easily take AI risk seriously, but I’m not sure what it is
motivated reasoning: it is not a neutral task to form an opinion on AI risk

Next, I’ll explain the categories in more detail.

Communication differences

When I try to discuss AI safety with others and if I remain “myself” as much as I can, I notice the following interpretations/concerns:

I don’t know what technical assumptions others have. How do we know what each hypothetical AI model is capable of? Often this is accompanied with “why is it not mentioned what level of accuracy this would need” or “where does the data come from”. I can understand Bostrom’s super-AI scenarios where you assume that everything is technically possible, but I’m having trouble relating some AI safety concepts to present-day AI, such as reinforcement learning.
Having read more about AI safety I now know more of the technical terms in the field. It seems to happen more often that AI safety enthusiasts explain a term very differently than what I think the meaning is, and if I ask for clarification, they might notice they are actually not that sure what the term relates to. I’m not trying to complain about what terms people are using, but I think this might contribute to me not understanding what is going on, since people seem to be talking past each other quite a bit anyway.
A lot of times, I don’t understand why AI safety enthusiasts want to use anthropomorphizing language when talking about AI, especially since a lot of people in the scene seem to be worried that it might lead to a too narrow understanding of AI. (For example, in the AGISF programme curriculum, this RL model behavior was referred to as AI “deceiving” the developers, while it actually is “human evaluators being bad at labeling images”. I feel it is important to be careful, because there are also concepts like deceptive alignment, where the deception happens on “purpose”. I guess this is partially aesthetic preference, since anthropomorphization seems to annoy some people involved in AI safety as well. But partly it probably has to do with real differences in thinking: if you really perceive current RL agents as agents, it might seem that they are tricking you into teaching them the wrong thing.)
I almost never feel like the person I am talking to about AI safety is really willing to consider any of my concerns as valid information: they are just estimating whether it is worth trying to talk me over. (At least the latter part of this interpretation is probably false in most cases.)
I personally dislike it when people answer my questions by just sending a link. Often the link does not clear my confusion, because it is rare that the link addresses the problem I am trying to figure out. (Not that surprising given that reading AI safety materials by my own initiative has not cleared that many confusions either.)
If I tell people I have read the resource they were pointing to they might just “give up” on the discussion and start explaining that it is ok for people to have different views on topics, and I still don’t get why they found the resource convincing/informative. I would prefer they would elaborate on why they thought the resource answers the concern/confusion I had.
I don’t want to start questioning AI safety too much around AI safety enthusiasts since I feel like I might insult them. (This is probably not helping anyone, and I think AI safety enthusiasts are anyway assuming I don’t believe AI safety is important.)

Probably a lot of the friction just comes from me not being used to a communication style people in AI safety are used to. But I think some of it might come from the emotional response from the AI safety enthusiast I am talking to, such as “being afraid of saying something wrong, causing Ada to further deprioritize AI safety” or “being tired of explaining the same thing to everyone who asks it” or even “being afraid of showing uncertainty since it is so hard to ever convince anyone of the importance of AI safety”. For example, some people might share a link instead of explaining a concept in one’s own words to save time, but some people might do it to avoid saying something wrong.

I wish I would know how to create a discussion where the person convinced of AI safety can drop the things that are “probably relevant” or “expert opinions” and focus on just clearly explaining to me what they currently believe. Maybe then, I could do the same. (Actionable note to myself: try asking people around me to do this.)

Differences in thinking

I feel like I lack the ability to model what AI safety enthusiasts are thinking or what they believe is true. This happens even when I talk with people I know personally and who have a similar educational background, such as other CS/DS majors in EA Finland. It is frustrating. The problem is not the disagreements: if I cannot model others, I don’t know if we are even disagreeing or not.

This is not the first time in my life when everyone else seems to behave strangely and irrationally, and every time before, there has been an explanation. Mostly, later it turned out others just were experiencing something I was not experiencing, or I was experiencing something they were not. I suspect something similar is going on between me and AI safety folks.

It would be very valuable to know what this difference in thinking is. Sadly, I have no idea. The only thing I have is a long list of possible explanations that I think are false:

“ability to feel motivated about abstract things, such as x-risk”. I think I am actually very inclined to get emotional about abstract things, otherwise I would probably not like EA. Some longtermists like to explain neartermists being neartermists by assuming “it feels more motivating to help people in a concrete way”. To me, neartermist EA does not feel very concrete either. If I happen to look at my GiveWell donations, I do not think about how many children were saved. I might think “hmm, this number has a nice color” or “I wish I could donate in euro so that the donation would be some nice round number”. But most of the time, I don’t think about it at all. On the other hand, preventing x-risk sounds very motivational. You are literally saving the world – who doesn’t want to do that? Who doesn’t want to live in the most important century and be one of the few people who realize this and are pushing the future to a slightly better direction?
“maybe x-risk just does not feel that real to you”. This might be partially true, in the sense that I do not go about my day with the constant feeling that all humanity might die this century. But this does not explain the difference, because I know other people who also don’t actively feel this way and are still convinced about AI risk.
“you don’t experience machine empathy”. It is the opposite: I experience empathy towards my Roomba, towards my computer, towards my Python scripts (“I’m so sorry I blame you for not working when it is me who writes the bugs”) and definitely towards my machine learning models (“oh stupid little model, I wish you know that this part is not what I want you to focus on”). Because of this tendency, I constantly need to remind myself that my scripts are not human, my Roomba does not have an audio input; and GPT-3 cannot purposefully lie to me, for it does not know what is true or false.
“you might lack mathematical ability”. I can easily name 15 people I personally know who are certainly more mathematically talented than me, but only one of them has an interest towards AI safety; and I suspect that I have more if not mathematical talent then at least more mathematical endurance than some AI safety enthusiasts I personally know.
“you are thinking too much inside the box of your daily work” This might be partially true, but I feel like I can model what Bostrom thinks of AI risk, and it is very different from my daily work. But I find it really difficult to think somewhere between concrete day-to-day AI work and futuristic scenarios. I have no idea how others know what assumptions hold and what don’t.
“you are too fixated on the constraints of present-day machine learning” If you think AGI will be created by machine learning, some of the basic constraints must hold for it, and a lot of AI safety work seems to (reasonably) be based on this assumption as well. For example, a machine learning model cannot learn patterns that are not present in the training data. (Sometimes AI safety enthusiasts seem to confuse this statement with “a machine learning model cannot do anything of which it does not have a direct example in the training data”, which is obviously not true, narrow AI models do this all the time.)
“motivated reasoning: you are choosing the outcome before looking at the facts”. Yes, motivated reasoning is an issue, but not necessarily in the direction you think it is.

Motivated reasoning

Do I want AI risk to be an x-risk? Obviously not. It would be better for everyone to have less x-risks around, and it would be even better if the whole concept of x-risk was false since it would somehow not be possible to have such extreme catastrophes ever happen. (I don’t think anyone thinks that, but it would be nice if it was true.)

But: If you are interested in making the world a better place, you have to do it by either fixing something horrible that is already going on or preventing something horrible from happening. It would be awfully convenient if I could come up with a cause that was:

Very important for the people who live today and who will ever be born
Very neglected: nobody else than this community understands its importance to the full extend
But this active community is working on fixing the problem, a lot of people want to cooperate with others in the community, there is funding for any reasonable project on the field
There are people I admire working in this field (famous people like Chris Olah; and this one EA friend who started to work in AI governance research during the writing of this text).
Also a lot of people whose texts have influenced me seem to think that this is of crucial importance.
Probably needs rapid action, there are not enough people with the right background, so members are encouraged to get an education or experience to learn the needed skills
I happen to have relevant experience already (I’m not a researcher but I do have a Master’s degree in ML and 5 years of experience in AI/DS/ML; my specialization is NLP which right now seems to be kind of a hot topic in the field.)

All of this almost makes me want to forget that I somehow still failed to be convinced by the importance of this risk, even when reading texts written by people who I otherwise find very credible.

(And saying this aloud certainly makes me want to forget the simple implication: if they are wrong about this, are they still right about the other stuff? Is the EA methodology even working? What problems are there with other EA cause areas? It would seem unreasonable to think EA got every single detail about everything right. But this is a big thing, and getting bigger. What if they are mistaken? What do I do if they are?)

The fear of the answer

Imagine that I notice that AI safety is, in fact, of crucial importance. What would this mean?

There would be some social consequences: as almost everyone I work with and who taught me anything about AI would be wrong, and most of my friends who are not in EA would probably not take me seriously. Among my EA friends, the AI safety non-enthusiasts would probably politely stop debating me on AI safety matters and decide that they don’t understand enough about AI to form an informed opinion on why they disagree with me. But maybe the enthusiasts would let me try to do something about AI risk, and we’d feel like we are saving the world, since it would be our best estimate that we are.

The practical consequences would most likely be ok, I think: I would probably try to switch jobs, and if that wouldn’t work out, swift the focus of my EA volunteering to AI safety related things. Emotionally, I think I would be better off if I could press a button that would make me convinced about AI safety on a deep rational understanding level. This might sound funny because being very worried about neglected impending doom does not seem emotionally very nice. But if I want to be involved with EA, it still might be the easiest route.

So, what if it turns out I think almost everyone in EA is wrong about AI risk being a priority issue? The whole movement would have estimated the importance of AI risk wrong, and getting more and more wrong as AI safety seems to get more traction. It would mean something has to be wrong in the way the EA movement makes decisions, since the decision making process had produced this great error. It would also mean that every time I interact with another person in the movement, I would have to choose between stating my true opinion about AI safety and risk ruining the possibility to cooperate with that person, or I would have to be dishonest.

Maybe this would cause me to leave the whole EA movement. I don’t want to be part of a movement that is supposed to use reason and evidence to find the best ways to do good, but is so bad at it they would have made such a great error. I would not have much hope of fixing the mistake from the inside, since I’m just a random person and nobody has any reason to listen to me. Somebody with a different personality type would maybe start a whole campaign against AI safety research efforts, but I don’t think I would ever do this, even if I believed these efforts are wrong.

Friends and appreciation

Leaving the EA movement would be bad, because I really like EA. I want to do good things and I feel like EA is helping me with that.

I also like my EA friends, and I am afraid they will think bad things about me if I don’t have good opinions on AI safety. To be clear, I don’t think my EA friends would expect me to agree with them on everything, but I do think they expect me to be able to develop reasonable and coherent opinions. Like, “you don’t have to take AI safety seriously, but you have to be able to explain why”. I am also worried my friends will think that I do not actually care about the future of humanity, or that I don’t have the ability to care for abstract things, or that I worry too much about things like “what do my friends think of me”.

On a related note, writing this whole text with the idea of sharing it with strangers scared me too. I felt like people will think I am not-EA-like, or will get mad at me for admitting I did not like Superintelligence. It would be bad if I decided that in the future I actually want to work on AI safety, but nobody would want to cooperate with me because I had voiced uncertainties before. I have heard people react to EA criticisms with “this person obviously did not understand what they are talking about” and I feel like many people might have a similar reaction to this text too, even if my point is not to criticize, but just to reflect on my own opinions.

I can not ask the nebulous concept of the EA community about this, but luckily, reaching out to my friends is way easier. I decided to ask them if they would still be my friends even if I decided my opinion on AI safety was “I don’t know and I don’t want to spend more time finding out so I’m going to default to thinking it is not important”.

We discussed for a few hours, and it turned out my friends would still want to be my friends and would still prefer me to be involved in EA and in our group, at least unless I started to actively work against AI safety. Also, they would actually not be that surprised if this was my opinion, since they feel a lot of people have fuzzy opinions about things.

So I think maybe it is not the expectation of my friends that is making me want to have a more coherent and reasonable opinion on AI safety. It is my own expectation.

What I think and don’t think of AI risk

What I don’t think of AI risk

I’m not at all convinced that there cannot be any risk from AI, either. (Formulated this strongly, this would be a stupid thing to be convinced about.)

More precisely, reading all the AI safety material taught me that there are very good counterarguments to the most common arguments stating that solving AI safety would be easy. These arguments were not that difficult for me to internalize, because I am generally pessimistic: it seems reasonable that if building strong AI is difficult then building safe strong AI should be even more difficult.

In my experience, it is hard to get narrow AI models to do what you want them to do. I probably would not for example step in a spaceship that is steered by a machine learning system, since I have no idea how you could prove that the statistical model is doing what it is supposed to do. Steering a spaceship sounds very difficult, but still a lot easier than understanding and correctly implementing “what humans want”, because even the whole prompt is very fuzzy and difficult for humans as well.

It does not make sense to me that any intelligent system would learn human-like values “magically” as a by-product of being really good at optimizing for something else. It annoys me that the most popular MOOC produced by my university states:

“The paper clip example is known as the value alignment problem: specifying the objectives of the system so that they are aligned with our values is very hard. However, suppose that we create a superintelligent system that could defeat humans who tried to interfere with its work. It’s reasonable to assume that such a system would also be intelligent enough to realize that when we say “make me paper clips”, we don’t really mean to turn the Earth into a paper clip factory of a planetary scale.”

I remember a point where I would have said “yeah, I guess this makes sense, but some people seem to disagree, so I don’t know”. Now I can explain why it is not reasonable. So in that sense, I have learned something. (Actionable note to self: contact the professor responsible for the course and ask him why they put this phrase in the material. He is a very nice person so I think he would at least explain it to me.)

But I am a bit at loss on why people in the AI safety field think it is possible to build safe AI systems in the first place. I guess as long as it is not proven that the properties of safe AI systems are contradictory with each other, you could assume it is theoretically possible. When it comes to ML, the best performance in practice is sadly often worse than the theoretical best.

Pessimism about the difficulty of the alignment problem is quite natural to me. I wonder if some people who are more optimistic about technology in general find AI safety materials so engaging because they at some point thought AI alignment could be a lot easier than it is. I find it hard to empathize with the people Yudkowsky first designed the AI box thought experiment for. As described in the beginning of this text, I would not spontaneously think that a superintelligent being was unable to manipulate me if it wanted to.

What I might think of AI risk

As you might have noticed, it is quite hard for me to form good views on AI risk. But I have some guesses of views that might describe what I think:

Somehow the focus in the AI risk scene currently seems quite narrow? I feel like while “superintelligent AI would be dangerous” makes sense if you believe superintelligence is possible, it would be good to look at other risk scenarios from current and future AI systems as well.
I think some people are doing what I just described, but since the field of AI safety is still a mess it is hard to know what work relates to what and what people mean with a certain terminology.
I feel like it would be useful to write down limitations/upper bounds on what AI systems are able to do if they are not superintelligent and don’t for example have the ability to simulate all of physics (maybe someone has done this already, I don’t know)
To evaluate the importance of AI risk against other x-risk I should know more about where the likelihood estimates come from. But I am afraid to try to work on this because so far it has been very hard to find any numbers. (Actionable note to myself: maybe still put a few hours into this even if it feels discouraging. It could be valuable.)
I feel like looking at the current progress in AI is not a good way to make any estimates on AI risk. I’m fairly sure deep learning alone will not result in AGI. (Russell thinks this too, but I had this opinion before reading Human Compatible, so it is actually based on gut feeling.)
In my experience, data scientists tend to be people who have thought about technology and ethics before. I think a lot of people in the field would be willing to hear actionable and well explained ways of making AI systems more safe. But of course this is very anecdotal.

What now?

Possible next steps

To summarize, so far I have tried reading about AI safety to either understand why people are so convinced about it or find out where we disagree. This has not worked out. By writing this text, it became clear to me that there are social and emotional issues preventing me from forming an opinion about AI safety. I have already started working on them by discussing them with my friends.

I have already mentioned some actionable points throughout the text in the relevant contexts. The most important one:

find a person who is willing to discuss AI safety with me: explain their own actual thinking, help me keep awareness of possibly falling in my “AI safety compatible” mode and listen to my probably very messy views

If you (yes, you!) are interested in a discussion like that, feel free to message me anytime!

Other things I already mentioned were:

contact my former professor and ask him what he thinks of AI risk and why value alignment is described to emerge naturally from intelligence in the course material
set aside time to find out and understand how x-risk likelihoods are calculated

Additional things I might do next are:

read The Alignment Problem (I don’t think it will provide me with that much more useful info, but I want to complete reading all the 3 AI alignment books people usually recommend)
write a short intro to AI safety in Finnish to clear my head and establish personal vocabulary for the field in my native language (I want to write for an audience, but I if I don’t like the result, then I will just not publish it anywhere)

Why the answer matters

I have spent a lot of time trying to figure out what my view on AI safety is, and I still don’t have a good answer. Why not give up, decide to remain undecided and do something else?

Ultimately, this has to do with what I think the purpose of EA is. You need to know what you are doing, because if you don’t, you cannot do good. You can try, but in the worst case you might end up causing a lot of damage.

And this is why EA is a license to care: the permission to stop resisting the urge to save the world, because it promises that if you are careful and plan ahead, you can do it in a way that actually helps. Maybe you can’t save everyone. Maybe you’ll make mistakes. But you are allowed to do your best, and regardless of whether you are a Good Person™ (or an altruistic and effective person) it will help.

As long as I don’t know how important AI safety is, I am not going to let myself actually care about it, only about estimating its importance.

I wonder if this, too, is risk aversion – a lot of AI safety enthusiasts seem to emphasize that you have to be able to cope with uncertainty and take risks if you want to do the most good. Maybe this attitude towards risk and uncertainty is actually the crux between me and AI safety enthusiasts I’m having such a hard time to find?

But I’m obviously not going to believe something I do not believe just to avoid seeming risk averse. Until I can be sure enough that the action I’m taking is going in the right direction, I am going to keep being careful.