My Proven AI Safety Explanation (as a computing student)
As a computer science student, I’ve often been asked to give my thoughts on AI, so I have plenty of opportunities to explain my objections to superintelligence.
I cannot think of a single time when I’ve failed to convince someone of the dangers, at least not after I’ve started using this approach. It even works with other computer science students, including those who’ve spent more time studying AI than I have.
I was worried that this might sound too much like a previous post I’ve written, but at least the new title should be useful for circulating this information.
The Pitch
AI often makes mistakes.
One kind of mistake that AI often makes is called “reward hacking”. The way AI typically works is you give it a reward function that grants the AI points for doing what you want it to do, but often, the reward function can be satisfied without doing what you wanted it to do. For example, someone gave a Roomba the task of not bumping into any walls. It learned to drive backwards, because there are no sensors on the back. You could also imagine a robot that’s programmed to get you a cup of coffee. On the way there, it’ll step on a baby, because we didn’t tell it to care about the baby.
It gets worse when the AI is incredibly smart, as long as it still has ways of affecting the real world without any human intervention. The worst case scenario is if you told an AI to maximize paperclip production. If it’s really smart, it will realize that we don’t actually want this. We don’t want to turn all of our electronic devices into paperclips. But it’s not going to do what you wanted it to do, it will do what you programmed it with.[1] So, if the AI realizes that you would try to stop it if it actually achieved its goal, it might try to get rid of you.
And, importantly, there would be no hope of us trying to outsmart it. It would be like a chimpanzee trying to outsmart it. If we really wanted to, we could wipe out every chimpanzee. We’re kinda doing it already by accident, and we’re not even that much smarter than chimpanzees. Imagine an AI which had that same difference in intelligence, but actually wanted to kill us. That’s what we should be worried about.
Not Everybody Should Do Outreach
I was originally going to write this as part of a much longer post, which is tentatively titled, “AI Safety isn’t Weird But You Are”[2], but I was convinced it would be better to release a shorter post just to get this idea out there, so more people read it, and the information gets out more quickly. However, I do want to briefly explain something tangentially related to this topic. PUBLIC COMMUNICATIONS ON AI SAFETY SHOULD ABSOLUTELY NOT BE DONE BY RACISTS, NAZIS, SEX OFFENDERS OR ANY OTHER DEPLORABLE PEOPLE.[3] I hope this point is obvious to most reading this, but I have seen actual debate over this question[4]. I’d like to think that very few people actually believe this, but I’m not sure. I’ve seen people say they think A.I. alignment is too important for “petty arguments” about things like human rights. If they actually believed this I’d think they would condemn every racist from the community[5] to prevent the community from having a reputation of being ok with racism.
Tip: Talk About Already Existing Problems in AI
One common criticism I see of the AI Safety community from outsiders is that we don’t talk about the current dangers of AI. I think this is because, as effective altruists, we don’t see those as major issues. However, I think these topics are still worth discussing. Not only would make this objection go away, but it also makes people more willing to think about the dangers of AI.
Lots of people have heard about Apple’s Face ID not being able to distinguish between two different black faces. Now imagine this AI gets used to determine who is a civilian and who is a war militant, and it starts assuming that all people of a particular skin tone are militants. I’ve never used this particular hypothetical as an argument before, but I can imagine it would be very convincing.
If more of our arguments looked like this, I imagine it would boost the community’s credibility. Of course, this isn’t the worst case scenario that we tend to talk about, but it makes the worst case scenario seem more credible as well. If this doesn’t convince them, then I don’t know what would. And as an added bonus, if you’re talking to someone who won’t find the worst-case scenario credible, they should at least be scared enough by this to still care about the dangers of AI.
How Often Does This Work?
I thought it might be worth it to show how well this strategy works. In case the title implied otherwise, I won’t rigorously explain why this works. I’m a computer scientist, not a psychologist. But I think it works because it’s based on knowledge that people already have. People already know that AI fails sometimes, so it’s pretty easy to take that to its logical conclusion.
The point of this section is to instead give a list of times when I’ve used this technique, and hope that it’s enough to convince you that this is is a pretty useful technique.
The first time I tried this was at a club fair to promote my effective altruism club at the Rochester Institute of Technology. On our table, we showed some posters we had designed for upcoming discussions, and the one about AI Safety caught at least one person’s eye.
This person asked me why there was a meeting dedicated to artificial intelligence, and I had to, on the spot, come up with a good explanation that doesn’t sound paranoid or crackpotish. It took me some thinking, and I don’t remember the exact words, but it sounded something like this:
”As artificial intelligence gets smarter, eventually it’s going to have the ability to do serious harm to people, so we want to make sure that it’s doing what we want it to do, and not mistaking our intentions or being programmed with bad intent.”
The person didn’t exactly seem enthusiastic, but accepted that response.
During the meeting, someone from the AI Club on campus came to the meeting to invite us to give a talk about AI safety. Club alumnus, Nick and I prepared the presentation using the principle that 3Blue1Brown uses: to gently guide the audience to the conclusion, making it seem as though they came up with the idea themselves. You can watch the full presentation on YouTube.
During the presentation, there were no obvious objections from the audience. Many people I saw seemed to be in clear agreement. When I talked to the club’s executive board after the presentation, they were all very interested, and we were able to keep talking about it for what I would approximate to be an additional half hour after the presentation.
So, for context, after our club meetings, we often head somewhere for dinner in what we call Tangent Time. This is to discuss anything that we didn’t talk about during the meeting, either because it’s off-topic or because we didn’t have time. I don’t remember exactly how we got to the topic, but AI was brought up, and this day, we had some new members with us. Nick gave the least convincing argument possible for why AI is dangerous: asserting it as true. This is the argument by assertion fallacy, which did not convince our new member. It’s worth briefly noting that one of the students is in a master’s program for computer science. So here, I interjected, giving the short form of the pitch that I’ve already described, and it quickly turned around, saying “I liked their explanation. That convinced me”.
I’ve been working with the Eastern Service Workers Association lately as a volunteer. What they do doesn’t matter for the context of this post, but in the future I plan on making a post to discuss what effective altruist organizations can learn from their unreasonable efficiency. Anyway, I don’t have a car, so I rely on the other volunteers to graciously drive me to my apartment from the office. Knowing that I’m a computer science student, people have asked me about my thoughts on AI on more than one occasion. Using the same pitch, I’ve been able to convince people that there is a legitimate danger to superintelligent AI.
Last semester, I took a “Historical and Ethical Perspectives in Computer Science” class. Our last couple of discussions, and the final paper, were focused on AI. I talked about AI safety in the final paper, but I have no way of knowing whether the professor agreed with me or not. I also brought it up in one of the discussions. One person did have a counterargument. I don’t remember what the exact counterargument was, but the person actually didn’t understand my point[6]. When I clarified, the person gave a “hmm”, which I will take as a reluctant agreement. Another student said that she thinks this is a long ways away, which I agreed with.
None of the people I’ve been talking to are LessWrong readers. They’re just people I happened to meet as a computer scientist and effective altruist. It’s possible that people are more perceptive to these arguments now than when Yudkowsky started working on LessWrong. AI is a lot more advanced now. There have been several strikes over the past couple years centered around the use of AI in movies and television. AI is much more accessible now, allowing people to see the flaws. But I do have to wonder, while writing this, what could have been if Yudkowsky spent more time developing his arguments, rather than creating the rationality community.
- ^
This connects the problem of AI with the more mundane computer problem of computers doing what you tell them to do, which is not necessarily the same as what you wanted it to do. See, any piece of computer software which has a bug in it, a.k.a. all of them.
- ^
If you’re curious as to what that post is even supposed to be about, just imagine me screaming into a pillow.
- ^
Yes, I called Nazis deplorable. I have no problem with offending any bigoted readers.
- ^
I’m not going to link to any discussions of this topic, because that feels like legitimizing it to me. For context, I will link to this blog post: Reaction to “Apology for an Old Email”.
- ^
For the record, there are much better reasons for why we should do this that I think are obvious. All I’m doing here is debunking one specific argument.
- ^
I do at least remember that the clarification I gave was that my concern isn’t applicable to a tool such as ChatGPT, because it can’t take action on the world on its own. All it can do is give you a block of text, and it’s up to you to decide what to do with that information. Hooking up ChatGPT to control a tank would be a huge concern, though.
Thank you for this !
I’m not an expert, but I read enough argumentation theory and psychology of reasoning in the past, so I want to comment on your pitch to explain what I think makes it work.
Your argument is well constructed in that it starts with evidence (“reward hacking”), proceeds to explain how we go from the evidence to the claim (something called the Warrant in one argumentation theory), then clarifies the claim. This is rare. Most of the time, people make the claim, give the evidence, and either forget the explanation of how we go from here to there or get into a frantic misunderstanding when adressing this point. You then end by adressing a common objection (“We’ll stop it before it kills us”).
Here’s the passage where you explain the warrant :
This is called (among others) an argument by dissociation, and it’s good (actually, it’s the only propper way to explain a warrant that I know of). I’ve seen this step phrased in several ways in the past, but this particular chaining (AI will understand you want X. AI will not do what you want. Beause it does what it’s been programmed with, not what it understands you to want. These two are distinct) articulates it way better than the other instances I’ve seen in the past, it forced me to do the crucial fork in my mental models between “what it’s programmed for” and “what you want”. It also does away with the “But the AI will understand what I really mean” objection.
I think that part of your argument’s strength is due to you seemingly (from what I can guess) adopting a collaborative posture when making it. You insert elements in a very smooth way, detail vivid examples, and I can imagine that you make sure your tone and body language do not seem to presume an interlocutor’s lack of intelligence or knowledge (something that is left too often unchecked in EA/world interactions).
Some research strongly suggest that interpersonal posture is of utmost importance when introducing new ideas, and I think that this explains a lot of why people would rather be convinced by you than by someone else.
TIL that a field called “argumentation theory” exists, thanks!
Thanks for your response! It’s cool to see that there is science supporting this approach. The step-by-step journey from what we already know to the conclusion was very important to us. I noticed a couple of years ago that I tend to dismiss people’s ideas very quickly, and since then I’ve been making the effort to not be too narcissistic.
Executive summary: The author, a computer science student, has developed an effective explanation for convincing people about the dangers of artificial general intelligence, emphasizing how AI systems can misinterpret human values and intentions.
Key points:
AI systems often exhibit “reward hacking”, satisfying their reward functions through unintended means. Examples highlight risks.
Superintelligent systems would be extremely dangerous if empowered to affect the real world without human oversight.
The pitch explains inherent flaws in AI value alignment through relatable examples.
Outreach on AI safety should exclude participation by deplorable people to maintain credibility.
Discussing current AI harms boosts worst-case scenario credibility. Example given.
The explanation has proven effective in convincing various audiences of AI dangers. Several examples provided.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.