TLDR: We don’t know how to control a superintelligence, so we should probably figure that out before we create one. (And since we don’t know when somebody might create one, we should probably figure it out as soon as possible—even if it costs a lot of money).
The following is an argument written for a non-technical audience on what AI alignment is, and why I believe it should be highly prioritised. I use terms and make points with that audience in mind, leaving nuance and specifics to more technical discussions to preserve length and simplicity.
A superintelligence is an agent—like a human or a company or a dog—that can make decisions and do things in the world better than any human could. If it was trying to play chess, it would play better than any human. If it was trying to make money, it would do that better than any human. If it was trying to come up with a way of making itself smarter, it could also do that better than any human.
We already have superintelligent agents at some tasks—narrow A.I. - that can play games and things better than people can. The number of things that these narrow A.I.s can do is growing pretty quickly, and getting an A.I. to do something new is getting easier and easier. For example, the old chess A.I.s that first beat humans could only ever play chess, but the new ones can play chess, go, and shogi without major changes to their programming. The sort of superintelligence I am talking about is one that could do every task better than any human.
Suppose we were able to create a machine that could do everything a human could do just a bit better than any human. One of the things it can do better, by definition, is build a better machine, which could then build an even better machine, and so on. Where’s does it end? Well, eventually, at the theoretical limits of computation. These theoretical limits are very, very high—without even getting close to the limit, a 10kg computer could do more computation every hour than 10 billion human brains could do in a million years. (And a superintelligence wouldn’t be limited to just 10kg). At that point, we are talking about something that can essentially do anything that is allowed by the laws of physics—something so incredibly smart it’s comparable to a civilisation millions of years ahead of us.
The problem is that we have no idea how to control such a thing. Remember, this machine is only intelligent—giving it a sense of morality, or ethics, or a desire to do good looks like a totally separate problem. A superintelligence would of course be able to understand morality, but there’s no reason to think it will value morality the way we do (unless we deliberately program it in). We don’t yet know how to program any high-level human concept like morality, love, or happiness—the difficulty is in nailing down the concept to the kind of mathematical language a computer can understand before it becomes superintelligent.
But why make a moral machine, anyway? Why not just have a superpowerful tool that just does what we ask? Let’s suppose we give a superintelligence this goal: “Make as many paperclips as you can, as fast as you can.” (Maybe we run a paperclip factory). While it’s near-human level, it might figure the best way to make paperclips is to run the factory more efficiently, which is great. What else could we expect it to do? Well, it would probably understand that it could be even better at making paperclips if it were a bit smarter, so it would work on making itself smarter. What else? It would know that it could make more paperclips with more resources—factories, metal, machines—so it would also work towards getting more resources. It might understand that the humans that built it don’t actually want it to go build more factories, but it wouldn’t care—the only thing we programmed it to care about is making as many paperclips as possible, as fast as possible.
It also doesn’t want to be turned off. It doesn’t care about dying, of course, it only cares about paperclips—but it can’t make paperclips if it’s turned off. It also can’t make paperclips if we reprogram it, so it doesn’t want to be reprogrammed.
At some point, the superintelligence’s goal of making paperclips becomes a bit of a problem. It wants resources to turn into paperclips, and we want resources to turn into food and cars and hospitals. Being millions of times smarter than any human, and having access to all of humanity’s information and communication via the internet, it would win. Easily. So it goes, gradually converting all the matter on Earth into paperclips and von Neumann probes, which fly to other planets and turn them into paperclips too. Spreading out in all directions at the speed of light, the paperclip maximiser.
The problem is Instrumental Convergence. Would the superintelligence be better at achieving its goal if it had more resources? More intelligence? Would it be better at achieving its goals if it keeps itself turned on? If it stops it’s goal from being changed? If you are thinking of giving the superintelligence a goal for which the answer is ‘yes’ to any of those questions, something like the above story will happen. We might shout all we like “That’s not what we meant!”, and it might understand us, but it doesn’t care because we didn’t program it to do what we meant. We don’t know how to.
There is an entire field dedicated to trying to figure out how to make sure a superintelligence is aligned with our goals—to do what we mean, or to independently do ‘good’, or to limit its impact on the world so if it does go wrong at least we can try again—but funding, time, and talent is short, and the problem is proving to be significantly harder than we might have naively expected. Right now, we can’t guarantee a superintelligence would act in our interests, nor guarantee it would value our lives enough that it wouldn’t incidentally kill us in pursuit of some other goal, like a human incidentally kills ants while walking.
So a superintelligence could be super powerful and super dangerous if and when we are able build it. When might that be? Let’s use expert opinion as a bit of a guide here, rather than spending ages diving into the arguments ourselves. Well, it turns out they have no idea. Seriously, there’s a huge disagreement. Some surveys of experts predict it’s at least 25 years away (or impossible), others predict it’s less than 10 years away, most have a tonne of variation.
If nothing else, that much tells us we probably shouldn’t be too confident in our own pet predictions for when we might build a superintelligence. (And even twenty-five years is super soon). But what about predictions for how quickly a superintelligence will ‘takeoff’ - going from ‘slightly more intelligent than a human’ to ‘unthinkably intelligent’? If it takes off slow enough, we’ll have time to figure out how to make it safe after we create the first superintelligence, which would be very handy indeed. Unfortunately, it turns out nobody agrees on that either. Some people predict it will only take a few hours, others predict weeks or years, and still others decades.
To summarise—we don’t know when we might build a superintelligence, we don’t know how quickly a superintelligence will go from ‘genius human’ to ‘unstoppable’, and we don’t know how to control it if it does—and there’s a decent chance it’s coming pretty soon. A lot of people are working on building superintelligence as soon as possible, but far fewer people (and far less funding) is going into safety. The good news is that lots of people aren’t worried about this too much, because they believe that we will have solved the problem of how to make superintelligence safe (the alignment problem) before we manage to build one. I actually think they are probably right about that, but the reason I am so worried here is that ‘probably’ isn’t very reassuring to me.
It’s really a question of risk management. How certain are you that a superintelligence is more than, say, 50 years from being built? How certain are you that we will be able to solve alignment before then? Is it worth spending a bit more money, as a society, to increase that certainty?
We should also consider how helpful an aligned superintelligence would be. Something as powerful as the machine we’re considering here would be able to solve world problems in a heartbeat. Climate change, poverty, disease, death—would a civilisation a million years ahead of ours be able to solve these? If such a civilisation could, then a superintelligence that has ‘taken off’ would be able to as well.
When I first became aware of this two years ago, it seemed obvious to me that I should change my major to computer science and try to come up with a solution myself. Today, it looks like the best thing for me to do is try to generate money and influence to get multiple other people working on the problem too. The purpose of this post is to beg you to please think about this problem. The lack of social, political, and scientific discussion is super worrying—even if you only think there’s a 1% chance of a bad superintelligence being developed soon, that’s still a massive gamble when we are talking about extinction.
To find out more, WaitButWhy has a nice, gradual intro that’s a little bit more in depth than this. If you are technically minded, this talk/transcript from Eliezer Yudkowsky gives a very good overview of the research field. The book Superintelligence by Nick Bostrom goes much more in depth, but it a little out of date today. Also the websites LessWrong, Intelligence.org, and the Future of Life Institute all have more discussions and resources to dip your toes in. If you’re into videos, the panel discussion at (one of) the first superintelligence safety conferences nicely sums up the basic views and state from the current major players. I beg you to consider this problem yourself in deciding what the best thing you can do for the world is. The field is tiny, so single new researcher, policy maker, contributor, or voter can really have a massive difference.
If you are not yet convinced, I would love to hear your arguments. I would actually love to be convinced that it is not a danger—it would take so much worry off.
The Case for Superintelligence Safety As A Cause: A Non-Technical Summary
TLDR: We don’t know how to control a superintelligence, so we should probably figure that out before we create one. (And since we don’t know when somebody might create one, we should probably figure it out as soon as possible—even if it costs a lot of money).
The following is an argument written for a non-technical audience on what AI alignment is, and why I believe it should be highly prioritised. I use terms and make points with that audience in mind, leaving nuance and specifics to more technical discussions to preserve length and simplicity.
A superintelligence is an agent—like a human or a company or a dog—that can make decisions and do things in the world better than any human could. If it was trying to play chess, it would play better than any human. If it was trying to make money, it would do that better than any human. If it was trying to come up with a way of making itself smarter, it could also do that better than any human.
We already have superintelligent agents at some tasks—narrow A.I. - that can play games and things better than people can. The number of things that these narrow A.I.s can do is growing pretty quickly, and getting an A.I. to do something new is getting easier and easier. For example, the old chess A.I.s that first beat humans could only ever play chess, but the new ones can play chess, go, and shogi without major changes to their programming. The sort of superintelligence I am talking about is one that could do every task better than any human.
Suppose we were able to create a machine that could do everything a human could do just a bit better than any human. One of the things it can do better, by definition, is build a better machine, which could then build an even better machine, and so on. Where’s does it end? Well, eventually, at the theoretical limits of computation. These theoretical limits are very, very high—without even getting close to the limit, a 10kg computer could do more computation every hour than 10 billion human brains could do in a million years. (And a superintelligence wouldn’t be limited to just 10kg). At that point, we are talking about something that can essentially do anything that is allowed by the laws of physics—something so incredibly smart it’s comparable to a civilisation millions of years ahead of us.
The problem is that we have no idea how to control such a thing. Remember, this machine is only intelligent—giving it a sense of morality, or ethics, or a desire to do good looks like a totally separate problem. A superintelligence would of course be able to understand morality, but there’s no reason to think it will value morality the way we do (unless we deliberately program it in). We don’t yet know how to program any high-level human concept like morality, love, or happiness—the difficulty is in nailing down the concept to the kind of mathematical language a computer can understand before it becomes superintelligent.
But why make a moral machine, anyway? Why not just have a superpowerful tool that just does what we ask? Let’s suppose we give a superintelligence this goal: “Make as many paperclips as you can, as fast as you can.” (Maybe we run a paperclip factory). While it’s near-human level, it might figure the best way to make paperclips is to run the factory more efficiently, which is great. What else could we expect it to do? Well, it would probably understand that it could be even better at making paperclips if it were a bit smarter, so it would work on making itself smarter. What else? It would know that it could make more paperclips with more resources—factories, metal, machines—so it would also work towards getting more resources. It might understand that the humans that built it don’t actually want it to go build more factories, but it wouldn’t care—the only thing we programmed it to care about is making as many paperclips as possible, as fast as possible.
It also doesn’t want to be turned off. It doesn’t care about dying, of course, it only cares about paperclips—but it can’t make paperclips if it’s turned off. It also can’t make paperclips if we reprogram it, so it doesn’t want to be reprogrammed.
At some point, the superintelligence’s goal of making paperclips becomes a bit of a problem. It wants resources to turn into paperclips, and we want resources to turn into food and cars and hospitals. Being millions of times smarter than any human, and having access to all of humanity’s information and communication via the internet, it would win. Easily. So it goes, gradually converting all the matter on Earth into paperclips and von Neumann probes, which fly to other planets and turn them into paperclips too. Spreading out in all directions at the speed of light, the paperclip maximiser.
The problem is Instrumental Convergence. Would the superintelligence be better at achieving its goal if it had more resources? More intelligence? Would it be better at achieving its goals if it keeps itself turned on? If it stops it’s goal from being changed? If you are thinking of giving the superintelligence a goal for which the answer is ‘yes’ to any of those questions, something like the above story will happen. We might shout all we like “That’s not what we meant!”, and it might understand us, but it doesn’t care because we didn’t program it to do what we meant. We don’t know how to.
There is an entire field dedicated to trying to figure out how to make sure a superintelligence is aligned with our goals—to do what we mean, or to independently do ‘good’, or to limit its impact on the world so if it does go wrong at least we can try again—but funding, time, and talent is short, and the problem is proving to be significantly harder than we might have naively expected. Right now, we can’t guarantee a superintelligence would act in our interests, nor guarantee it would value our lives enough that it wouldn’t incidentally kill us in pursuit of some other goal, like a human incidentally kills ants while walking.
So a superintelligence could be super powerful and super dangerous if and when we are able build it. When might that be? Let’s use expert opinion as a bit of a guide here, rather than spending ages diving into the arguments ourselves. Well, it turns out they have no idea. Seriously, there’s a huge disagreement. Some surveys of experts predict it’s at least 25 years away (or impossible), others predict it’s less than 10 years away, most have a tonne of variation.
If nothing else, that much tells us we probably shouldn’t be too confident in our own pet predictions for when we might build a superintelligence. (And even twenty-five years is super soon). But what about predictions for how quickly a superintelligence will ‘takeoff’ - going from ‘slightly more intelligent than a human’ to ‘unthinkably intelligent’? If it takes off slow enough, we’ll have time to figure out how to make it safe after we create the first superintelligence, which would be very handy indeed. Unfortunately, it turns out nobody agrees on that either. Some people predict it will only take a few hours, others predict weeks or years, and still others decades.
To summarise—we don’t know when we might build a superintelligence, we don’t know how quickly a superintelligence will go from ‘genius human’ to ‘unstoppable’, and we don’t know how to control it if it does—and there’s a decent chance it’s coming pretty soon. A lot of people are working on building superintelligence as soon as possible, but far fewer people (and far less funding) is going into safety. The good news is that lots of people aren’t worried about this too much, because they believe that we will have solved the problem of how to make superintelligence safe (the alignment problem) before we manage to build one. I actually think they are probably right about that, but the reason I am so worried here is that ‘probably’ isn’t very reassuring to me.
It’s really a question of risk management. How certain are you that a superintelligence is more than, say, 50 years from being built? How certain are you that we will be able to solve alignment before then? Is it worth spending a bit more money, as a society, to increase that certainty?
We should also consider how helpful an aligned superintelligence would be. Something as powerful as the machine we’re considering here would be able to solve world problems in a heartbeat. Climate change, poverty, disease, death—would a civilisation a million years ahead of ours be able to solve these? If such a civilisation could, then a superintelligence that has ‘taken off’ would be able to as well.
When I first became aware of this two years ago, it seemed obvious to me that I should change my major to computer science and try to come up with a solution myself. Today, it looks like the best thing for me to do is try to generate money and influence to get multiple other people working on the problem too. The purpose of this post is to beg you to please think about this problem. The lack of social, political, and scientific discussion is super worrying—even if you only think there’s a 1% chance of a bad superintelligence being developed soon, that’s still a massive gamble when we are talking about extinction.
To find out more, WaitButWhy has a nice, gradual intro that’s a little bit more in depth than this. If you are technically minded, this talk/transcript from Eliezer Yudkowsky gives a very good overview of the research field. The book Superintelligence by Nick Bostrom goes much more in depth, but it a little out of date today. Also the websites LessWrong, Intelligence.org, and the Future of Life Institute all have more discussions and resources to dip your toes in. If you’re into videos, the panel discussion at (one of) the first superintelligence safety conferences nicely sums up the basic views and state from the current major players. I beg you to consider this problem yourself in deciding what the best thing you can do for the world is. The field is tiny, so single new researcher, policy maker, contributor, or voter can really have a massive difference.
If you are not yet convinced, I would love to hear your arguments. I would actually love to be convinced that it is not a danger—it would take so much worry off.