The Case for Superintelligence Safety As A Cause: A Non-Technical Summary
TLDR: We don’t know how to control a superintelligence, so we should probably figure that out before we create one. (And since we don’t know when somebody might create one, we should probably figure it out as soon as possible—even if it costs a lot of money).
The following is an argument written for a non-technical audience on what AI alignment is, and why I believe it should be highly prioritised. I use terms and make points with that audience in mind, leaving nuance and specifics to more technical discussions to preserve length and simplicity.
A superintelligence is an agent—like a human or a company or a dog—that can make decisions and do things in the world better than any human could. If it was trying to play chess, it would play better than any human. If it was trying to make money, it would do that better than any human. If it was trying to come up with a way of making itself smarter, it could also do that better than any human.
We already have superintelligent agents at some tasks—narrow A.I. - that can play games and things better than people can. The number of things that these narrow A.I.s can do is growing pretty quickly, and getting an A.I. to do something new is getting easier and easier. For example, the old chess A.I.s that first beat humans could only ever play chess, but the new ones can play chess, go, and shogi without major changes to their programming. The sort of superintelligence I am talking about is one that could do every task better than any human.
Suppose we were able to create a machine that could do everything a human could do just a bit better than any human. One of the things it can do better, by definition, is build a better machine, which could then build an even better machine, and so on. Where’s does it end? Well, eventually, at the theoretical limits of computation. These theoretical limits are very, very high—without even getting close to the limit, a 10kg computer could do more computation every hour than 10 billion human brains could do in a million years. (And a superintelligence wouldn’t be limited to just 10kg). At that point, we are talking about something that can essentially do anything that is allowed by the laws of physics—something so incredibly smart it’s comparable to a civilisation millions of years ahead of us.
The problem is that we have no idea how to control such a thing. Remember, this machine is only intelligent—giving it a sense of morality, or ethics, or a desire to do good looks like a totally separate problem. A superintelligence would of course be able to understand morality, but there’s no reason to think it will value morality the way we do (unless we deliberately program it in). We don’t yet know how to program any high-level human concept like morality, love, or happiness—the difficulty is in nailing down the concept to the kind of mathematical language a computer can understand before it becomes superintelligent.
But why make a moral machine, anyway? Why not just have a superpowerful tool that just does what we ask? Let’s suppose we give a superintelligence this goal: “Make as many paperclips as you can, as fast as you can.” (Maybe we run a paperclip factory). While it’s near-human level, it might figure the best way to make paperclips is to run the factory more efficiently, which is great. What else could we expect it to do? Well, it would probably understand that it could be even better at making paperclips if it were a bit smarter, so it would work on making itself smarter. What else? It would know that it could make more paperclips with more resources—factories, metal, machines—so it would also work towards getting more resources. It might understand that the humans that built it don’t actually want it to go build more factories, but it wouldn’t care—the only thing we programmed it to care about is making as many paperclips as possible, as fast as possible.
It also doesn’t want to be turned off. It doesn’t care about dying, of course, it only cares about paperclips—but it can’t make paperclips if it’s turned off. It also can’t make paperclips if we reprogram it, so it doesn’t want to be reprogrammed.
At some point, the superintelligence’s goal of making paperclips becomes a bit of a problem. It wants resources to turn into paperclips, and we want resources to turn into food and cars and hospitals. Being millions of times smarter than any human, and having access to all of humanity’s information and communication via the internet, it would win. Easily. So it goes, gradually converting all the matter on Earth into paperclips and von Neumann probes, which fly to other planets and turn them into paperclips too. Spreading out in all directions at the speed of light, the paperclip maximiser.
The problem is Instrumental Convergence. Would the superintelligence be better at achieving its goal if it had more resources? More intelligence? Would it be better at achieving its goals if it keeps itself turned on? If it stops it’s goal from being changed? If you are thinking of giving the superintelligence a goal for which the answer is ‘yes’ to any of those questions, something like the above story will happen. We might shout all we like “That’s not what we meant!”, and it might understand us, but it doesn’t care because we didn’t program it to do what we meant. We don’t know how to.
There is an entire field dedicated to trying to figure out how to make sure a superintelligence is aligned with our goals—to do what we mean, or to independently do ‘good’, or to limit its impact on the world so if it does go wrong at least we can try again—but funding, time, and talent is short, and the problem is proving to be significantly harder than we might have naively expected. Right now, we can’t guarantee a superintelligence would act in our interests, nor guarantee it would value our lives enough that it wouldn’t incidentally kill us in pursuit of some other goal, like a human incidentally kills ants while walking.
So a superintelligence could be super powerful and super dangerous if and when we are able build it. When might that be? Let’s use expert opinion as a bit of a guide here, rather than spending ages diving into the arguments ourselves. Well, it turns out they have no idea. Seriously, there’s a huge disagreement. Some surveys of experts predict it’s at least 25 years away (or impossible), others predict it’s less than 10 years away, most have a tonne of variation.
If nothing else, that much tells us we probably shouldn’t be too confident in our own pet predictions for when we might build a superintelligence. (And even twenty-five years is super soon). But what about predictions for how quickly a superintelligence will ‘takeoff’ - going from ‘slightly more intelligent than a human’ to ‘unthinkably intelligent’? If it takes off slow enough, we’ll have time to figure out how to make it safe after we create the first superintelligence, which would be very handy indeed. Unfortunately, it turns out nobody agrees on that either. Some people predict it will only take a few hours, others predict weeks or years, and still others decades.
To summarise—we don’t know when we might build a superintelligence, we don’t know how quickly a superintelligence will go from ‘genius human’ to ‘unstoppable’, and we don’t know how to control it if it does—and there’s a decent chance it’s coming pretty soon. A lot of people are working on building superintelligence as soon as possible, but far fewer people (and far less funding) is going into safety. The good news is that lots of people aren’t worried about this too much, because they believe that we will have solved the problem of how to make superintelligence safe (the alignment problem) before we manage to build one. I actually think they are probably right about that, but the reason I am so worried here is that ‘probably’ isn’t very reassuring to me.
It’s really a question of risk management. How certain are you that a superintelligence is more than, say, 50 years from being built? How certain are you that we will be able to solve alignment before then? Is it worth spending a bit more money, as a society, to increase that certainty?
We should also consider how helpful an aligned superintelligence would be. Something as powerful as the machine we’re considering here would be able to solve world problems in a heartbeat. Climate change, poverty, disease, death—would a civilisation a million years ahead of ours be able to solve these? If such a civilisation could, then a superintelligence that has ‘taken off’ would be able to as well.
When I first became aware of this two years ago, it seemed obvious to me that I should change my major to computer science and try to come up with a solution myself. Today, it looks like the best thing for me to do is try to generate money and influence to get multiple other people working on the problem too. The purpose of this post is to beg you to please think about this problem. The lack of social, political, and scientific discussion is super worrying—even if you only think there’s a 1% chance of a bad superintelligence being developed soon, that’s still a massive gamble when we are talking about extinction.
To find out more, WaitButWhy has a nice, gradual intro that’s a little bit more in depth than this. If you are technically minded, this talk/transcript from Eliezer Yudkowsky gives a very good overview of the research field. The book Superintelligence by Nick Bostrom goes much more in depth, but it a little out of date today. Also the websites LessWrong, Intelligence.org, and the Future of Life Institute all have more discussions and resources to dip your toes in. If you’re into videos, the panel discussion at (one of) the first superintelligence safety conferences nicely sums up the basic views and state from the current major players. I beg you to consider this problem yourself in deciding what the best thing you can do for the world is. The field is tiny, so single new researcher, policy maker, contributor, or voter can really have a massive difference.
If you are not yet convinced, I would love to hear your arguments. I would actually love to be convinced that it is not a danger—it would take so much worry off.
Thank you for this nice summary of the argument in favour of AI Safety as a cause. I am not convinced, but I appreciate your write-up. As you asked for counterarguments, I’ll try to describe some of my gripes with the AI Safety field. Some have to do with how there seems to be little awareness of results in adjacent fields, making me doubt if any of it would stand up to scrutiny from people more knowledgeable in those areas. There are also a number of issues I have with the argument itself.
The theoretical limits of computation are lower bounds, we don’t know if it is possible to achieve them for any kind of computation, let alone for general computation. Moreover, having a lot of computational power probably doesn’t mean that you can calculate everything. A lot of real-world problems are hard to approximate in a way that adding more computational power doesn’t meaningfully help you. For example, computing approximate Nash-equilibria or finding good lay-outs for microchip design. It is not clear that having a lot of computing power translates into relevant superior capabilities.
There is a growing literature on making algorithms fair, accountable and transparent. This is a collaborative effort between researchers in computer science, law and many other fields. There are so many similarities between this and the professed goals of the AI Safety community that it is strange that no cross-fertilization is happening.
You can’t just ask the AI to “be good”, because the whole problem is getting the AI to do what you mean instead of what you ask. But what if you asked the AI to “make itself smart”? On the one hand, instrumental convergence implies that the AI should make itself smart. On the other hand, the AI will misunderstand what you mean, hence not making itself smart. Can you point the way out of this seeming contradiction?
AI Safety would be a worthy cause if a superintelligence were powerful and dangerous enough to be an issue but not so powerful and dangerous as to be uncontrollable. A solution has to be necessary, but it also has to exist. Thus, there is a tension between scale and tractability here. Both Bostrom and Yudkowsky only ever address one thing at a time, never acknowledging this tension.
Most estimates on take-off speed start counting from the point that the AI is superintelligent. Why wait until then? A computer can be reset, so if you had a primitive AGI specimen you’d have unlimited tries to spot problems and make it behave.
I’d say that a 0.0001% chance of a superintelligence catastrophe is a huge over-estimate. Hence, AI Safety would be an ineffective cause area if you hold a person-affecting view. If you don’t, then at least this opens the way for the kind of counterarguments used against Pascal’s Mugging.
(Under the background assumptions already being made in the scenario where you can “ask things” to “the AI”:) If you try to tell the AI to be smart, but fail and instead give it some other goal (let’s call it being smart’), then in the process of becoming smart’ it will also try to become smart, because no matter what smart’ actually specifies, becoming smart will still be helpful for that. But if you want it to be good and mistakenly tell it to be good’, it’s unlikely that being good will be helpful for being good’.
Sorry for the delay on this reply. It’s been a very busy week.
Okay, so, to be clear—I am making the argument that superintelligence safety is an important area that is underfunded today, and you are arguing that extinction caused by superintelligence is so unlikely that it shouldn’t be a concern. Is that accurate?
With that in mind, I’ll go through you points here one by one, and then attempt to address some of arguments in your blog posts (though the first post was unavailable!).
I agree with you here. My reason for bringing this up in the main post was to show that superintelligence is possible under today’s understanding of physics. Raw computation is not intelligent by itself, we agree, but rather one requirement for it. I was just pointing out the computation that could be done in a small amount of matter is much larger than the computation that is done in the brain. (And that the brain’s computation is in a pattern that we call general intelligence).
I didn’t mention a lot of good research relevant to safety, and progress is being made in many independant directions for sure. I do agree, I would also like to see more of a crossover, though I really don’t know how much the two areas are already working off the other’s progress. I’d be surprised if it were zero. Regardless, if it were zero, it would show poor communication, rather than say anything about the concerns being wrong.
I mean, there’s no rule that a superintelligence has to misunderstand you. And there’s no certainty instrumental convergence is correct. (I wouldn’t risk my life on either statement!) It’s just that we think being smarter would help achieve most goals, so we probably should expect a superintelligence to try and make itself smarter.
The other part is we just don’t know how to guarantee that a superintelligence will do what we mean. (If you do know how to do this, that would be a huge relief). Even in your example of trying to get an superintelligence just to make itself smarter, I certainly wouldn’t be confident it would do it in the way I expect—I have enough trouble predicting how my programs today will run. Suppose I’d written a utility function for ‘smartness’ that actually just measured total bits flipped, for example, I might not realise until afterwards, which wouldn’t be good.
I might be misunderstanding you here. Are you arguing that because superintelligence does not yet exist, it is not yet worthwhile to work on safety? Or are you arguing that we can’t be confident that a solution to alignment will work without a superintelligence to test it on?
If it’s the first, I would argue that there’s a major risk that we won’t find a solution in the period of time between creating a superintelligence, and the superintelligence having enough power to be a big problem. Unless I was super confident this time period would be very large, wouldn’t it make more sense to try and find a solution as early as possible?
I’d also argue that solving a solution early would mean it could be worked into the design of a superintelligence early, rather than just relying on the class of solutions that would fit something that’s already been built.
If it’s the second, I agree—it would be a much easier problem to solve if we had a ‘mini’-superintelligence to practice on, for sure. Figuring out how to do this is a part of safety research! How can we limit a superintelligence’s capabilities so it stays in this state? How can we predict what will happen as we increase a weak superintelligence to a strong superintelligence? We still need to figure out how to do that as well, hence my call for research funding.
I am not sure this is true, I’ve always read takeoff speed estimates as counting from the moment of human-level general intelligence—though I know many people imagine a human-level AGI as having access to current narrow superintelligence (as in, max[human, current computer] abilities at each task). Maybe that’s it.
Regardless, as above, I hope we get that chance, though from the little research that has been done it looks like this might not be as safe as it sounds. We would have to be very very good at determining the capability of an AGI, be confident that no other project is moving forward faster than us, and be confident that the behaviour will remain the same as intelligence increases—which might be the trickiest one. For example, a near-human AGI might be able to predict that doing what humans want early on would make it more likely to achieve its goal later on, no matter what the goal actually is. -- So we haven’t avoided catastrophe, only added an instrumental goal of ‘behaving the way humans want me to until I have enough power to disregard them without being shut down’. Still, this is an open area of research and I hope it gets more funding and attention.
Getting into your arguments for that figure below, though I want to clarify here my estimate of superintelligence being built this century is in the double digits percentage wise, and that if it’s built before we solve alignment it is almost certain to be dangerous. I’m not relying on very low probabilities of drastic outcomes, so Pascal’s Mugging doesn’t apply.
_
Onwards to a some limited responses to your blog posts. I wasn’t entirely sure if I understood your argument properly, so I’m going to try and list the main points here and see if you agree.
1. You argue that if the probability of an AI-related extinction event were large, and if a single AI-related extinction event could affect any lifeform in the universe ever, one should have already happened somewhere and we shouldn’t exist.
2. You argue that current safety research is ineffective—we’d be able to work more effectively and cheaply if we waited until we were closer to developing superintelligence.
3. You believe that if a superintelligence was going to be built in the near future, and if it was going to be dangerous, it would probably result in a smaller scale catastrophe that would give us plenty of warning that a bigger catastrophe was coming.
4. You believe that there are numerous psychological reasons people are inclined to believe superintelligence is likely and dangerous, and so increase your skepticism of the claims because of that.
5. You argue that left to its own devices, regular commercial or academic research will be able to solve the problem.
If there’s a major point I’ve missed here, or if I’ve phrased these badly, do correct me! Anyway, let’s go through them.
If the probability of broadcasting radio into space were large, we should have already detected alien radio. (Since radio would also spread at the speed of light in all directions, and be distinct from natural events). I don’t believe this is strong evidence against the hypothesis that superintelligence (or radio) is possible and dangerous, though I suppose it’s evidence that there are no other advanced civilisations within our past light cone.
It is hard to say how effective current safety research is, for sure. If anything, the limited progress should make us think this problem is very hard and make us way less confident about being able to solve it in a short period of time in the future. Particularly since some aspects of safety get harder to implement the longer we wait—building culture and institutions that consider the issue when setting up their AGI projects, for instance.
If the time period between a small scale catastrophe and a large one is small, we shouldn’t be confident that we can solve safety in time—especially if you are right about a small scale catastrophe being evidence we are nearing superintelligence.
Additionally, if there exist large scale failure modes that are wholly different to any small scale failure mode, we shouldn’t expect learning from small scale catastrophes to help us prevent larger ones.
Alternatively, we might even make large scale failures harder to detect by patching small failures—for example, we might think we’ve prevented a superintelligence from trying to escape onto the internet, but we’ve really just made escaping so hard that only a strong superintelligence could manage it.
Humanities general lack of concern generally about climate change or nuclear weapons (prior to them being created / caused) would indicate to me the psychological trends go in the other direction, at least for most people. Regardless, I would certainly agree with being really skeptical about extraordinary claims.
I would argue that it’s an extraordinary claim both ways. Either superintelligence is not that hard to build, or there is something so incredibly complicated and special about biological general intelligence that even with billions of dollars of funding per year for a hundred years, we won’t manage to replicate it—even as we replicate other aspects of biological intelligence (like vision, or motor control).
You might argue, fairly, that this is more likely, but do you really believe it is billions of times as likely?
I’m not sure if you’re main disagreement is with superintelligence being built at all, or with it being dangerous, so let’s look at that quickly too. If we are skeptical of superintelligence being dangerous because it seems extraordinary, we should also be skeptical of the extraordinary claim that a superintelligence would be be safe and good by default. (If it is not by default, we already have discovered how difficult it is specify safe behaviour).
I really hope so.
Commercially, building a superintelligence (or rather, every step towards superintelligence) would be extremely profitable. But since safety research would take some of your best minds away from building it, the incentives are in the wrong direction. Whoever spends the least on safety has the largest proportion of their resources to spend on development.
As far as regular academic research goes, it’s more hopeful, but the number of people working on safety in traditional academia is very very low. How confident can we be that this low output would be enough to solve the problem prior to building a superintelligence—especially given how difficult we’ve found it to be so far—and considering how many ambitious researchers are working on building a superintelligence as soon as possible? Perhaps money could be best spent persuading those researchers to consider safety, I don’t know.
To conclude, I want to lay out what would change my mind:
If progress on computer hardware and software seemed very likely to halt (or slow dramatically) in the near future.
If our current understanding of neuroscience turned out to be wrong, and we could show that simulating general purpose computation required far more computation than the brain’s cells do—perhaps the brain uses hard-to-compute actions on the level of atoms or smaller, rather than something that could be done in abstract models of cells.
If somebody was able to disprove (or provide very strong evidence against) the orthogonality thesis and instrumental convergence thesis.
If no project was working on building superintelligence.
Otherwise, it seems very much like we could have the capability of simulating and optimising a general intelligence in the near future, and that this could be very dangerous.
Let me try to rephrase this part, as I consider it to be the main part of my argument and it doesn’t look like I managed to convey what I intended to:
The most popular cause evaluation framework within EA seems to be Importance/Neglectedness/Tractability. AI Safety enthusiasts tell a convincing story on importance and neglectedness being good and make an effort at arguing that tractability is as well.
But here is the thing: all arguments given in favour of AI being risky (to establish importance) can be rephrased as arguments against tractability. Similarly for neglectedness.
I’ll illustrate this with a caricature, but it takes little effort to transfer this line of thought to the real arguments being made. Let’s say the pro-AIS argument is “AGI will become infinitely smart, so it can out-think all humans and avoid all our security measures. Hence AGI is likely to escape any restrictions we put on it, so it will be able to tile the universe with paperclips if it wants to”. Obviously, if it can out smart any security measure, then no sufficient security exists, AI Safety research will never lead to anything and the problem is intractable.
AI Safety is only effective if you can simultaneously argue for each of importance/neglectedness/tractability without detracting from the others. Moreover, your arguments have to address the exact same scenarios. It is not enough for AIS to be important with 50% probability and tractable with 50% probability, these two properties have be likely to hold simultaneously. A coin flip has 50% probability of heads and 50% probability of tails, but they will never happen at the same time.
AI Safety can only be an effective cause (on the margin) if solving it is possible (tractability) but not trivial (importance/neglectedness). I think this is a narrow window to hit, and current arguments are all way off-target.
Ah, thanks for rephrasing that. To make sure I’ve got this right—there’s a window between something being ‘easy to solve’ and ‘impossible to solve’ that a cause has to exist in to be worth funding. If it were ‘easy to solve’ it would be solved in the natural course of things, but if it were ‘impossible to solve’ there’s no point working on it.
When I argue that AGI safety won’t be solved in the normal course of AGI research, that is an argument that pushes it towards the ‘impossible’ side of the tractability scale. We agree up to this point, I think.
If I’ve got that right, then if I could show that it would be possible to solve AGI safety with increased funding, you would agree that it’s a worthy cause area? I suppose we should go through all the literature and judge for ourselves if progress is being made in the field. That might be a bit of a task to do here, though.
For the sake of argument, let’s say technical alignment is a totally intractable problem, when then? Give up and let extinction happen? If the problem does turn out to be impossible to solve, then no other cause area matters either because everybody is dead. If the problem is solvable, and we build a superintelligence, then still no other cause area matters because a superintelligence would be able to solve those problems.
This is kind of why I expected your argument to be about whether a superintelligence will be built, and when. Or about why you think that safety is a more trivial problem than I do. If you’re arguing the other way—that safety is an impossible problem—then wouldn’t you instead argue for stopping it being built in the first place?
I don’t know how tractable technical alignment will turn out to be. There has been some progress, but my main takeaway has been “We’ve discovered X, Y, and Z won’t work.”. If there is still no solution as we get closer to AGI being developed, then at least we’ll be able to point to that failure to try and slow down dangerous projects. Maybe the only safe solution will be to somehow increase human intelligence, rather than creating an independent AGI at all, I don’t know.
On the other hand, it might be totally solvable. It’s theoretical research, we don’t know until it’s done. If it is easily solved, then the problem becomes making sure that all AGI projects implement the solution, which would still be an effective cause. In either case, marginal increases in funding wouldn’t be wasted.
Thank you for your response.
Yes, that is what I meant. If you could convince me that AGI Safety were solvable with increased funding, and only solvable with increased funding, that would go a long way in convincing me of it being an effective cause.
In response to your question of giving up: If AGI were a long way off from being built, then helping others now is still a useful thing to do, no matter if either of the scenarios you describe were to happen. Sure, extinction would be bad, but at least from some person-affecting viewpoints I’d say extinction is not worse than existing animal agriculture.
Thanks for your response! I just wanted to let you know I’m taking the time to read your links and write out a well thought out reply, which might take another evening or two.
Do you still plan to publish a reply at some point?
Yes, apologies for the delay, it’s been a hectic week! Will hopefully post tomorrow.