The Cruel Trade-Off Between AI Misuse and AI X-risk Concerns
Introduction
In this post, I will discuss the cruel trade-off between misuse concerns and X-risks (accidental risks) regarding racing. It is important to note that this post does not advocate for one risk being more plausible than the other, nor does it make any normative statements. The purpose is to analyze the outcome of a particular worldview.
In one sentence, the key claim is: “If an entity in power to develop AGI is mostly worried about misuse AND think that they’re the best entity (morally speaking) in their reference class, it is good to race. This is bad for accidental risks”.
Epistemic status: I’ve written up this post pretty quickly, after a conversation where it didn’t seem clear to someone. I’m confident about the general claim, less about specific claims.
Why If One Is Worried about Misuse They Should Race?
a. AGI is the most powerful thing ever created. Those who will control that will have an unprecedented level of control over everyone else. So if you end up with someone with bad intents controlling that, that’s probably the end for everyone else. It also leaves plenty of room for scenarios like stable authoritarianism.
b. Thus, if you’re at the head of an AGI lab and you’re genuinely worried about some other AGI labs’ CEO’s ethics, you would want to ensure that they don’t develop AGI before you do. You could be worried about individuals getting power or certain cultures or governments getting power (the US fearing China, or a lab fearing another)a. In a world where weak AGIs accessible to a wide range of people enable the creation of bioweapons or facilitate massive cyberattacks, there is a danger of reaching a point where everyone can kill everyone else but there is not yet a powerful “defensive AGI” to prevent this via global deterence & surveillance.
b. If you view your lab as responsible and if you’re primarily concerned about misuse, it makes sense to race towards the development of a defensive AGI of a sufficient power to avoid the dangerous scenario mentioned above.a. Preventing jailbreaks is hard. So you may want to reach AGI as early as possible with as few deployments as possible. So the earlier you get to AGI, the better the world is.
b. Hence you should race as fast as possible[1] internally and deploy as little as possible externally to get the necessary capital to be able to reach the goalpost as early as possible.
And to be clear, here the cause of the racing is beliefs on the world, not secretely evil intentions.
Some Reasons Why Racing is Bad for Accidents
On the other hand, for those who’re worried about accidental risks and who are pessimistic about our chances of solving those issues in a short amount of time (“alignment is hard”), racing is one of, if not the worst thing:
Racing lets very little time to address problems as they arise. They’ll incentivize labs to patch problems in the cheapest way that works, e.g. fine-tuning. This is differentially more worrying for accident than for misuse because while you can control misuse with access restrictions, you can’t control accidents.
Racing doesn’t let time to understand what’s going on inside the models. It also incentivizes to build the simplest AGI that works and is not easily misused rather than something which has strong foundations for making sure it doesn’t cause accidents. Not understanding what’s going on is most worrying for those concerned about deception scenarios.
Racing leads to cut corners internally on red teaming for accidents, making sure the model is not deceptive etc.
It’s important to note that except that point (we should race really fast), most other measures are the same to solve misuse problems and accidental risks problems, i.e. auditing, licensing, developing models in close source, getting compute governance right, working to make models robust to jailbreaks etc.
- ^
“as fast as possible” includes constraints like “make your model non trivial to break to prevent misuse”. The main problem is just that preventing misuse requires a priori much less engineering and intervention on the model itself than preventing accidents.
The best way to stop people racing (for whatever reason) is to convince them that it’s suicidal. It’s a race where if anyone gets to the finish line, we all die. No one knows how to reliably align an AGI enough to prevent a catastrophic outcome from it’s existence. It may even be impossible.
I’ve seen this a few times but I’m skeptical about taking this rhetorical approach.
I think a large fraction of AI risk comes from worlds where the ex ante probability of catastrophe is more like 50% than 100%. And in many of those worlds, the counterfactual impact of individual developers move faster is several times smaller (since someone else is likely to kill us all in the bad 50% of worlds). On top of that, reasonable people might disagree about probabilities and think 10% in a case where I think 50%.
So putting that together they may conclude that racing faster increases the risk of doom by 0.03% for every 1% that it increases your share of the future (whether measured in profit, or reduced opportunity for misuse of frontier systems). And that’s just not going to be compelling.
I think you will have an extremely hard time convincing people that the race is obviously suicidal. I know some folks are confident about this, but I don’t really find that position credible today and I’ve spent a very long time thinking about the problem and engaging with pessimistic people. Maybe it will become obvious tomorrow, and maybe it’s OK for some people to be betting their chips on that, but I don’t want to get lumped in with them (because I think their political position is going to become increasingly untenable over time).
On the flip side, I don’t think it’s controversial to say: “If the probability of AI takeover is 10%, AI developers need to stop racing.”
It’s a tiny bit unclear what that means, so to be a bit more precise: “If people didn’t stop AI development until things look significantly more dangerous than they do today, then then the probability of takeover would be more than 10%.” I don’t think that’s true today, but will likely become true.
Have you had an in depth discussion with Roman Yampolskiy? (If not, I think you should!)
I think the Overton Window is really shifting on the issue of AGI x-risk now, with it going mainstream. The burden of proof should be on the developers of AGI to prove that it is 100% safe (as opposed to the previous era where it was on the x-risk worriers to prove it was dangerous). Do you have a good post-GPT-4+plugins/AutoGPT (new 2023 era) answer to this question (a mechanistic explanation for why we get an ok outcome, given AGI).
I’m pushing back against the framing: “this is a suicide race with no benefit from winning.”
If there is a 10% chance of AI takeover, then there is a real and potentially huge benefit from winning the race. But we still should not be OK with someone unilaterally taking that risk.
I agree that AI developers should have to prove that the systems they build are reasonably safe. I don’t think 100% is a reasonable ask, but 90% or 99% seem pretty safe (i.e. robustly reasonable asks).
(Edited to complete cutoff sentence and clarify “safe.”)
Sorry, just to clarify, what do we mean by “safe” here? Clearly a 90% chance of the system not disempowering all of humanity is not sufficient (and neither would a 99% chance, though that’s maybe a bit more debatable), so presumably you mean something else here.
I mean that 90% or 99% seem like clearly reasonable asks, and 100% is a clearly unreasonable ask.
I’m just saying that the argument “this is a suicide race” is really not the way we should go. We should say the risk is >10% and that’s obviously unacceptable, because that’s an argument we can actually win.
Hmm, just to be clear, I think saying that “this deployment has a 1% chance of causing an existential risk, so you can’t deploy it” seems like a pretty reasonable ask to me.
I agree that I would like to focus on the >10% case first, but I also don’t want to set wrong expectations that I think it’s reasonable at 1% or below.
I agree. When I give numbers I usually say “We should keep the risk of AI takeover beneath 1%” (though I haven’t thought about it very much and mostly the numbers seem less important than the qualitative standard of evidence).
I think that 10% is obviously too high. I think that a society making reasonable tradeoffs could end up with 1% risk, but that it’s not something a government should allow AI developers to do without broader public input (and I suspect that our society would not choose to take this level of risk).
Cool, makes sense. Seems like we are mostly on the same page on this subpoint.
90% or 99% safe is still gambling the lives of 80M-800M humans in expectation (in the limit of scaling to superintelligence). I don’t think it’s acceptable for AI companies, with no democratic mandate, to be unilaterally making that decision!
Or did you mean to say something to that effect with this truncated sentence?
Yeah, the sentence cut off. I was saying: obviously a 10% risk is socially unacceptable. Trying to convince someone it’s not in their interest is not the right approach, because doing so requires you to argue that P(doom) is much greater than 10% (at least with some audiences who care a lot about winning a race). Whereas trying to convince policy makers and the public that they shouldn’t tolerate the risk requires meeting a radically lower bar, probably even 1% is good enough.
I think arguing P(doom|AGI) >>10% is a decent strategy. So far I haven’t had anyone give good enough reasons for me to update in the other direction. I think the CEOs in the vanguard of AGI development need to really think about this. If they have good reasons for thinking that P(doom|AGI) ≤ 10%, I want to hear them! To give a worrying example: LeCun is, frankly, sounding like he has no idea of what the problem even is. OpenAI might think they can solve alignment, but their progress on alignment to date isn’t encouraging (this is so far away from the 100% watertight, 0 failure modes that we need). And Google Deepmind are throwing caution to the wind (despite safetywashing their statement with 7 mentions of the word “responsible”/”responsibly”).
The above also has the effect of shifting the public framing toward the burden being on the AI companies to prove their products are safe (in terms of not causing global catastrophe). I’m unsure as to whether the public at large would tolerate a 1% risk. Maybe they would (given the potential upside). But we are not in that world. The risk is at least 50%, probably closer to 99% imo.
Paul, you are saying “50/50 chance of doom” here (on the Bankless podcast). Surely that is enough to be using the suicide race argument!? I mean “it’s not suicide, it’s a coin flip; heads utopia, tails you’re doomed” seems like quibbling at this point. Or at least: you should be explicit when talking to CEOs that you think it’s 50⁄50 that AGI dooms us!
Re the you’re in “you’re doomed”—I used that instead of “we’re doomed”, because when CEOs hear “we’re”, they’re probably often thinking “not we’re, you’re. I’ll be alright in my secure compound in NZ”. But they really won’t! Do they think that if the shit hits the fan with this and there are survivors, there won’t be the vast majority of the survivors wanting justice?
Thanks Simeon. Do you have a view on whether misuse risk or accidental risk is more worrying?
I may try to write something on that in the future. I’m personally more worried about accidents and think that solving accidents causes one to solve misuse pre-AGI. Post aligned AGI, misuse rebecomes a major worry.
I guess it’s possible that, post-AutoGPT, we are in a world where warning shots are much more likely, because there will be a lot more misuse than was previously expected.