What does it mean for an AGI to be ‘safe’?
(Note: This post is probably old news for most readers here, but I find myself repeating this surprisingly often in conversation, so I decided to turn it into a post.)
I don’t usually go around saying that I care about AI “safety”. I go around saying that I care about “alignment” (although that word is slowly sliding backwards on the semantic treadmill, and I may need a new one soon).
But people often describe me as an “AI safety” researcher to others. This seems like a mistake to me, since it’s treating one part of the problem (making an AGI “safe”) as though it were the whole problem, and since “AI safety” is often misunderstood as meaning “we win if we can build a useless-but-safe AGI”, or “safety means never having to take on any risks”.
Following Eliezer, I think of an AGI as “safe” if deploying it carries no more than a 50% chance of killing more than a billion people:
When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. [...] Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’.
Notably absent from this definition is any notion of “certainty” or “proof”. I doubt we’re going to be able to prove much about the relevant AI systems, and pushing for proofs does not seem to me to be a particularly fruitful approach (and never has; the idea that this was a key part of MIRI’s strategy is a common misconception about MIRI).
On my models, making an AGI “safe” in this sense is a bit like finding a probabilistic circuit: if some probabilistic circuit gives you the right answer with 51% probability, then it’s probably not that hard to drive the success probability significantly higher than that.
If anyone can deploy an AGI that is less than 50% likely to kill more than a billion people, then they’ve probably… well, they’ve probably found a way to keep their AGI weak enough that it isn’t very useful. But if they can do that with an AGI capable of ending the acute risk period, then they’ve probably solved most of the alignment problem. Meaning that it should be easy to drive the probability of disaster dramatically lower.
The condition that the AI actually be useful for pivotal acts is an important one. We can already build AI systems that are “safe” in the sense that they won’t destroy the world. The hard part is creating a system that is safe and relevant.
Another concern with the term “safety” (in anything like the colloquial sense) is that the sort of people who use it often endorse the “precautionary principle” or other such nonsense that advocates never taking on risks even when the benefits clearly dominate.
In ordinary engineering, we recognize that safety isn’t infinitely more important than everything else. The goal here is not “prevent all harms from AI”, the goal here is “let’s use AI to produce long-term near-optimal outcomes (without slaughtering literally everybody as a side-effect)”.
Currently, what I expect to happen is that humanity destroys itself with misaligned AGI. And I think we’re nowhere near knowing how to avoid that outcome. So the threat of “unsafe” AI indeed looms extremely large—indeed, this seems to be rather understating the point!—and I endorse researchers doing less capabilities work and publishing less, in the hope that this gives humanity enough time to figure out how to do alignment before it’s too late.
But I view this strategic situation as part of the larger project “cause AI to produce optimal long-term outcomes”. I continue to think it’s critically important for humanity to build superintelligences eventually, because whether or not the vast resources of the universe are put towards something wonderful depends on the quality and quantity of cognition that is put to this task.
If using the label “AI safety” for this problem causes us to confuse a proxy goal (“safety”) for the actual goal “things go great in the long run”, then we should ditch the label. And likewise, we should ditch the term if it causes researchers to mistake a hard problem (“build an AGI that can safely end the acute risk period and give humanity breathing-room to make things go great in the long run”) for a far easier one (“build a safe-but-useless AI that I can argue counts as an ‘AGI’”).
I don’t think it’s a good plan to build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI. See Critch here and here. When I think about building AI that is safe, I think about multiple layers of safety including monitoring, robustness, alignment, and deployment. Safety is not a single system that doesn’t destroy the world; it’s an ongoing process that prevents bad outcomes. See Hendrycks here and here.
My reply to Critch is here, and Eliezer’s is here and here.
I’d also point to Scott Alexander’s comment, Nate’s “Don’t leave your fingerprints on the future”, and my:
What, concretely, do you think humanity should do as an alternative to “build an AI that enacts some pivotal act ensuring that nobody ever builds a misaligned AGI”? If you aren’t sure, then what’s an example of an approach that seems relatively promising to you? What’s a concrete scenario where you imagine things going well in the long run?
To sharpen the question: Eventually, as compute becomes more available and AGI techniques become more efficient, we should expect that individual consumers will be able to train an AGI that destroys the world using the amount of compute on a mass-marketed personal computer. (If the world wasn’t already destroyed before that.) What’s the likeliest way you expect this outcome to be prevented, or (if you don’t think it ought to be prevented, or don’t think it’s preventable) the likeliest way you expect things to go well if this outcome isn’t prevented?
(If your answer is “I think this will never happen no matter how far human technology advances” and “in particular, the probability seems low enough to me that we should just write off those worlds and be willing to die in them, in exchange for better focusing on the more-likely world where [scenario] is true instead”, then I’d encourage saying that explicitly.)
At that level of abstraction, I’d agree! Dan defines robustness as “create models that are resilient to adversaries, unusual situations, and Black Swan events”, monitoring as “detect malicious use, monitor predictions, and discover unexpected model functionality”, alignment as “build models that represent and safely optimize hard-to-specify human values”, and systemic safety as “use ML to address broader risks to how ML systems are handled, such as cyberattacks”. All of those seem required for a successful AGI-mediated pivotal act.
If this description is meant to point at a specific alternative approach, or meant to exclude pivotal acts in some way, then I’m not sure what you have in mind.
I agree on both fronts. Destroying the world is insufficient (you need to save the world; we already know how to build AI systems that don’t destroy the world), and a pivotal act fails if it merely delays doom, rather than indefinitely putting a pause on AGI proliferation (an “ongoing process”, albeit one initiated by a fast discrete action to ensure no one destroys the world tomorrow).
But I think you mean to gesture at some class of scenarios where the “ongoing process” doesn’t begin with a sudden discrete phase shift, and more broadly where no single actor ever uses AI to do anything sudden and important in the future. What’s a high-level description of how this might realistically play out?
You linked to the same Hendrycks paper twice; is there another one you wanted to point at? And, is there a particular part of the paper(s) you especially wanted to highlight?
Thanks for the thoughtful response. My original comment was simply to note that some people disagree with the pivotal act framing, but it didn’t really offer an alternative and I’d like to engage with the problem more.
I think we have a few worldview differences that drive disagreement on how to limit AI risk given solutions to technical alignment challenges. Maybe you’d agree with me in some of these places, but a few candidates:
Stronger AI can protect us against weaker AI. When you imagine a world where anybody can train an AGI at home, you conclude that anybody will be able to destroy the world from home. I would expect that governments and corporations will maintain a sizable lead over individuals, meaning that individuals cannot take over the world. They wouldn’t necessarily need to preempt the creation of an AGI; they could simply contain it afterwards, by denying it access to resources and exposing its plans for world destruction. This is especially true in worlds where intelligence alone cannot take over the world, and instead requires resources or cooperation between entities, as argued in Section C of Katja Grace’s recent post. I could see somw of these proposals overlapping with your definition of a pivotal act, though I have more of a preference for multilateral and government action.
Government AI policy can be competent. Our nuclear non-proliferation regime is strong, only 8 countries have nuclear capabilities. Gain-of-function research is a strong counter example, but the Biden administration’s export controls on selling advanced semiconductors to China for national security purpose again support the idea of government competence. Strong government action seems possible with either (a) significant AI warning shots or (b) convincing mainstream ML and policy leaders of the danger of AI risk. When Critch suggested that governments build weapons to monitor and disable rogue AGI projects, Eliezer said it’s not realistic but would be incredible if accomplished. Those are the kinds of proposals I’d want to popularize early.
I have longer timelines, expect a more distributed takeoff, and have a more optimistic view of the chances of human survival than I’d expect you do. My plan for preventing AI x-risk is to solve the technical problems, and to convince influential people in ML and policy that the solutions must be implemented. They can then build aligned AI, and employ measures like compute controls and monitoring of large projects to ensure widespread implementation. If it turns out that my worldview is wrong and an AI lab invents a single AGI that could destroy the world relatively soon, I’d be much more open to dramatic pivotal acts that I’m not excited about in my mainline scenario.
Three more targeted replies to your comments:
Your proposed pivotal act in your reply to Critch seems much more reasonable to me than “burn all GPUs”. I’m still fuzzy on the details of how you would uncover all potential AGI projects before they get dangerous, and what you would do to stop them. Perhaps more crucially, I wouldn’t be confident that we’ll have AI that can run whole brain emulation of humans before we have AI that brings x-risk, because WBE would likely require experimental evidence from human brains that early advanced AI will not have.
I strongly agree with the need for more honest discussions about pivotal acts / how to make AI safe. I’m very concerned by the fact that people have opinions they wouldn’t share, even within the AI safety community. One benefit of more open discussion could be reduced stigma around the term — my negative association comes from the framing of a single dramatic action that forever ensures our safety, perhaps via coercion. “Burn all GPUs” exemplifies these failure modes, but I might be more open to alternatives.
I really like “don’t leave you fingerprints on the future.” If more dramatic pivotal acts are necessary, I’d endorse that mindset.
This was interesting to think about and I’d be curious to answer any other questions. In particular, I’m trying to think how to ensure ongoing safety in Ajeya’s HFDT world. The challenge is implementation, assuming somebody has solved deceptive alignment using e.g. interpretability, adversarial training, or training strategies that exploit inductive biases. Generally speaking, I think you’d have to convince the heads of Google, Facebook, and other organizations that can build AGI that these safety procedures are technically necessary. This is a tall order but not impossible. Once the leading groups are all building aligned AGIs, maybe you can promote ongoing safety either with normal policy (e.g. compute controls) or AI-assisted monitoring (your proposal or Critch’s EMPs). I’d like to think about this more but have to run.
Unfortunately, people (and this includes AI researchers) tend to hear what they want to hear, and not what they don’t want to hear. What to call this field is extremely dependent on the nature of those misinterpretations. And the biggest misinterpretation right now does not appear to be “oh so I guess we need to build impotent systems because they’ll be safe”.
“Alignment” is already broken, in my view. You allude to this, but I want to underscore it. Instruct GPT was billed as “alignment”. Maybe it is, but it doesn’t seem to do any good for reducing x risk.
“Safety”, too, lends itself to misinterpretation. Sometimes of the form “ok, so let’s make the self-driving cars not crash”. So you’re not starting from an ideal place. But at least you’re starting from a place of AI systems behaving badly in ways you didn’t intend and causing harm. From there, it’s easier to explain existential safety as simply an extreme safety hazard, and one that’s not even unlikely.
If you tell people “produce long term near optimal outcomes” and they are EAs or rationalists, they probably understand what you mean. If they are random AI researchers, this is so vague as to be completely meaningless. They will fill it in with whatever they want. The ones who think this means full steam ahead toward techno utopia will think that. The ones who think this means making AI systems not misclassify images in racist ways will think that. The ones who think it means making AI systems output fake explanations for their reasoning will think that.
Everyone wants to make AI produce good outcomes. And you do not need to convince the vast majority of researchers to work on AI capabilities. They just do it anyway. Many of them don’t even do it for ideological reasons, they do it because it’s cool!
The differential thing we need to be pushing on is AI not creating an existential catastrophe. In public messaging (and what is a name except public messaging?) we do not need to distract with other considerations at this present moment. And right now, I don’t think we have a better term than safety that points in that direction.
Is this 50% from a the point of view of some hypothetical person who knows as much as is practical about this AGI’s consequences, or from your point of view or something else?
Do you imagine that deploying two such AGIs in parallel universes with some minor random differences has only a 25% chance of them both killing more than a billion people?
(After writing this I thought of one example where the goals are in conflict: permanent surveillance that stops the development of advanced AI systems. Thought I’d still post this in case others have similar thoughts. Would also be interested in hearing other examples.)
I’m assuming a reasonable interpretation of the proxy goal of safety means roughly this: “be reasonably sure that we can prevent AI systems we expect to be built from causing harm”. Is this a good interpretation? If so, when is this proxy goal in conflict with the goal of having “things go great in the long run”?
I agree that it’s epistemically good for people to not confuse proxy goals with goals, but in practice I have trouble thinking of situations where these two are in conflict. If we’ve ever succeeded in the first goal, it seems like making progress in the second goal should be much easier, and at that point it would make more sense to advocate using-AI-to-bring-a-good-future-ism.
Focusing on the proxy goal of AI safety seems also good for the reason that it makes sense across many moral views, while people are going to have different thoughts on what it means for things to “go great in the long run”. Fleshing out those disagreements is important, but I would think there’s time to do that when we’re in a period of lower existential risk.
Economic degrowth/stagnation is another example of something that prevents AI doom but will be very bad to have in the long run.
I like the term AGI x-safety, to get across the fact that you are talking about safety from existential (x) catastrophe, and sophisticated AI. “AI Safety” can be conflated with more mundane risks from AI (e.g. isolated incidents with robots, self-driving car crashes etc). And “AI Alignment” is only part of the problem. Governance is also required to implement aligned AI and prevent unaligned AI.
A lot of people misunderstand “existential risk” as meaning something like “extinction risk”, rather than as meaning ‘anything that would make the future go way worse than it optimally could have’. Tacking on “safety” might contribute to that impression; we’re still making it sound like the goal is just to prevent bad things (be it at the level of an individual AGI project, or at the level of the world), leaving out the “cause good things” part. That said, “existential safety” seems better than “safety” to me.
(Nate’s thoughts here.)
I don’t know what you mean by “governance”. The EA Forum wiki currently defines it as:
… which makes it sound like governance ignores plans like “just build a really good company and save the world”. If I had to guess, I’d guess that the world is likeliest to be saved because an adequate organization existed, with excellent internal norms, policies, talent, and insight. Shaping external incentives, regulations, etc. can help on the margin, but it’s a sufficiently blunt instrument that it can’t carve orgs into the exact right shape required for the problem structure.
It’s possible that the adequate organization is a government, but this seems less likely to me given the absolute number of exceptionally competent governments in history, and given that govs seem to play little role in ML progress today.
Open Phil’s definition is a bit different:
Open Phil goes out of its way to say “not just governments”, but its list (“norms, policies, laws, processes, politics, and institutions”) still makes it sound like the problem is shaped more like ‘design a nuclear nonproliferation treaty’ and less like ‘figure out how to build an adequate organization’, ‘cause there to exist such an organization’, or the various activities involved in actually running such an organization and steering it to an existential success.
Both sorts of activities seem useful to me, but dividing the problem into “alignment” and “governance” seems weird to me on the above framings—like we’re going out of our way to cut skew to reality.
On my model, the proliferation of AGI tech destroys the world, as a very strong default. We need some way to prevent this proliferation, even though AGI is easily-copied software. The strategies seem to be:
Using current tech, limit the proliferation of AGI indefinitely. (E.g., by creating a stable, long-term global ban on GPUs outside of some centralized AGI collaboration, paired with pervasive global monitoring and enforcement.)
Use early AGI tech to limit AGI proliferation.
Develop other, non-AGI tech (e.g., whole-brain emulation and/or nanotech) and use it to limit AGI proliferation.
1 sounds the most like “AGI governance” to my ear, and seems impossible to me, though there might be more modest ways to improve coordination and slow progress (thus, e.g., buying a little more time for researchers to figure out how to do 2 or 3). 2 and 3 both seem promising to me, and seem more like tech that could enable a long (or short) reflection, since e.g. they could also help ensure that humanity never blows itself up with other technologies, such as bio-weapons.
Within 2, it seems to me that there are three direct inputs to things going well:
Target selection: You’ve chosen a specific set of tasks for the AGI that will somehow (paired with a specific set of human actions) result in AGI nonproliferation.
Capabilities: The AGI is powerful enough to succeed in the target task. (E.g., if the best way to save the world is by building fast-running WBE, you have AGI capable enough to do that.)
Alignment: You are able to reliably direct the AGI’s cognition at that specific target, without any catastrophic side-effects.
There’s then a larger pool of enabling work that helps with one or more of those inputs: figuring out what sorts of organizations to build; building and running those organizations; recruiting, networking, propagating information; prioritizing and allocating resources; understanding key features of the world at large, like tech forecasting, social dynamics, and the current makeup of the field; etc.
“In addition to alignment, you also need to figure out target selection, capabilities, and (list of enabling activities)” seems clear to me. And also, you might be able to side-step alignment if 3 (or 1) is viable. “Moreover, you need a way to hand back the steering wheel and hand things off to a reasonable decision-making process” seems clear to me as well. “In addition to alignment, you also need governance” is a more opaque-to-me statement, so I’d want to hear more concrete details about what that means before saying “yeah, of course you need governance too”.
This is along the lines of what I’m thinking when I say AGI Governance. The scenario outlined by the winner of FLI’s World Building Contest is an optimistic vision of this.
This sounds like something to be done unilaterally, as per the ‘pivotal act’ that MIRI folk talk about. To me it seems like such a thing is pretty much as impossible as safely fully aligning an AGI, so working towards doing it unilaterally seems pretty dangerous. Not least for its potential role in exacerbating race dynamics. Maybe the world will be ended by a hubristic team who are convinced that their AGI is safe enough to perform such a pivotal act, and that they need to run it because another team is very close to unleashing their potentially world ending un-aligned AGI. Or by another team seeing all the GPUs starting to melt and pressing go on their (still-not-fully-aligned) AGI… It’s like MAD, but for well intentioned would-be world-savers.
I think your view of AGI governance is idiosyncratic because of thinking in such unilateralist terms. Maybe it could be a move that leads to (the world) winning, but I think that even though effective broad global-scale governance of AGI might seem insurmountable, it’s a better shot. See also aogara’s comment and its links.
Perhaps ASI x-safety would be even better though (the SI being SuperIntelligent), if people are thinking “we win if we can build a useless-but-safe AGI”.
I’d guess not. From my perspective, humanity’s bottleneck is almost entirely that we’re clueless about alignment. If a meme adds muddle and misunderstanding, then it will be harder to get a critical mass of researchers who are extremely reasonable about alignment, and therefore harder to solve the problem.
It’s hard for muddle and misinformation to spread in exactly the right way to offset those costs; and attempting to strategically sow misinformation so will tend to erode our ability to think well and to trust each other.
I’m not sure I get your point here. Surely the terms “AI Safety” and “AI Alignment” are already causing muddle and misunderstanding? I’m saying we should be more specific in our naming of the problem.
“ASI x-safety” might be a better term for other reasons (though Nate objects to it here), but by default, I don’t think we should be influenced in our terminology decisions by ‘term T will cause some alignment researchers to have falser beliefs and pursue dumb-but-harmless strategies, and maybe this will be good’. (Or, by default this should be a reason not to adopt terminology.)
Whether current terms cause muddle and misunderstanding doesn’t change my view on this. In that case, IMO we should consider changing to a new term in order to reduce muddle and misunderstanding. We shouldn’t strategically confuse and mislead people in a new direction, just because we accidentally confused or misled people in the past.
What are some better options? Or, what are your current favourites?
“AGI existential safety” seems like the most popular relatively-unambiguous term for “making the AGI transition go well”, so I’m fine with using it until we find a better term.
I think “AI alignment” is a good term for the technical side of differentially producing good outcomes from AI, though it’s an imperfect term insofar as it collides with Stuart Russell’s “value alignment” and Paul Christiano’s “intent alignment”. (The latter, at least, better subsumes a lot of the core challenges in making AI go well.)
Perhaps using “doom” more could work (doom encompasses extinction, permanent curtailment of future potential, and fates worse than extinction).
Good post, thank you.
”or other such nonsense that advocates never taking on risks even when the benefits clearly dominate”
An important point to note here—the people who suffer the risks and the people who reap the benefits are very rarely the same group. Deciding to use an unsafe AI system (whether presently or in the far future) using a risks/benefits analysis goes wrong so often because one man’s risk is another’s benefit.
Example: The risk of lung damage from traditional coal mining compared to the industrial value of the coal is a very different risk/reward analysis for the miner and the mine owner. Same with AI.
If an alignment-minded person is currently doing capabilities work under the assumption that they’d be replaced by an equally (or more) capable researcher less concerned about alignment, I think that’s badly mistaken. The number of people actually pushing the frontier forward is not all that large. Researchers at that level are not fungible; the differences between the first-best and second-best available candidates for roles like that are often quite large. The framing of an arms race is mistaken; the prize for “winning” is that you die sooner. Dying later is better. If you’re in a position like that I’d be happy to talk to you, or arrange for you to talk to another member of the Lightcone team.
I do not significantly credit the possibility that Google (or equivalent) will try to make life difficult for people who manage to successfully convince the marginal capabilities researcher to switch tracks, absent evidence. I agree that historical examples of vaguely similar things exist, but the ones I’m familiar with don’t seem analogous, and we do in fact have fairly strong evidence about the kinds of antics that various megacorps get up to, which seem to be strongly predicted by their internal culture.
I’d be interested in the historical record for similar industries, could you quickly list some examples that come to your mind? No need to elaborate much.