A thing I wrote on social media a few months ago, in response to someone asking if an AI warning shot might happen:
[… I]t’s a realistic possibility, but I’d guess it won’t happen before AI destroys the world, and if it happens I’m guessing the reaction will be stupid/panicky enough to just make the situation worse.
(It’s also possible it happens before AI destroys the world, but six weeks before rather than six years before, when it’s too late to make any difference.)
A lot of EAs feel confident we’ll get a “warning shot” like this, and/or are mostly predicating their AI strategy around “warning-shot-ish things will happen and suddenly everyone will get serious and do much more sane things”. Which doesn’t sound like, eg, how the world reacted to COVID or 9/11, though it sounds a bit like how the world (eventually) reacted to nukes and maybe to the recent Ukraine invasion?
Someone then asked why I thought a warning shot might make things worse, and I said:
It might not buy time, or might buy orders of magnitude less time than matters; and/or some combination of:
- the places that are likely to have the strictest regulations are (maybe) the most safety-conscious parts of the world. So you may end up slowing down the safety-conscious researchers much more than the reckless ones.
- more generally, it’s surprising and neat that the frontrunner (DM) is currently one of the least allergic to thinking about AI risk. I don’t think it’s anywhere near sufficient, but if we reroll the dice we should by default expect a worse front-runner.
- regulations and/or safety research are misdirected, because people have bad models now and are therefore likely to have bad models when the warning shot happens, and warning shots don’t instantly fix bad underlying models.
The problem is complicated, and steering in the right direction requires that people spend time (often years) setting multiple parameters to the right values in a world-model. Warning shots might at best fix a single parameter, ‘level of fear’, not transmit the whole model. And even if people afterwards start thinking more seriously and thereby end up with better models down the road, their snap reaction to the warning shot may lock in sticky bad regulations, policies, norms, culture, etc., because they don’t already have the right models before the warning shot happens.
- people tend to make worse decisions (if it’s a complicated issue like this, not just ‘run from tiger’) when they’re panicking and scared and feeling super rushed. As AGI draws visibly close / more people get scared (if either of those things ever happen), I expect more person-hours spent on the problem, but I also expect more rationalization, rushed and motivated reasoning, friendships and alliances breaking under the emotional strain, uncreative and on-rails thinking, unstrategic flailing, race dynamics, etc.
- if regulations or public backlash do happen, these are likely to sour a lot of ML researchers on the whole idea of AI safety and/or sour them on xrisk/EA ideas/people. Politicians or the public suddenly caring or getting involved, can easily cause a counter-backlash that makes AI alignment progress even more slowly than it would have by default.
- software is not very regulatable, software we don’t understand well enough to define is even less regulatable, whiteboard ideas are less regulatable still, you can probably run an AGI on a not-expensive laptop eventually, etc.
So regulation is mostly relevant as a way to try to slow everything down indiscriminately, rather than as a way to specifically target AGI; and it would be hard to make it have a large effect on that front, even if this would have a net positive effect.
- a warning shot could convince everyone that AI is super powerful and important and we need to invest way more in it.
- (insert random change to the world I haven’t thought of, because events like these often have big random hard-to-predict effects)
Any given big random change will tend to be bad on average, because the end-state we want requires setting multiple parameters to pretty specific values and any randomizing effect will be more likely to break a parameter we already have in approximately the right place, than to coincidentally set a parameter to exactly the right value.
There are far more ways to set the world to the wrong state than the right one, so adding entropy will usually make things worse.
We may still need to make some high-variance choices like this, if we think we’re just so fucked that we need to reroll the dice and hope to have something good happen by coincidence. But this is very different from expecting the reaction to a warning shot to be a good one. (And even in a best-case scenario we’ll need to set almost all of the parameters via steering rather than via rerolling; rerolling can maybe get us one or even two values close-to-correct if we’re crazy lucky, but the other eight values will still need to be locked in by optimization, because relying on ten independent one-in-ten coincidences to happen is obviously silly.)
- oh, [redacted]‘s comments remind me of a special case of ‘worse actors replace the current ones’: AI is banned or nationalized and the UK or US government builds it instead. To my eye, this seems a lot likelier to go poorly than the status quo.
There are plenty of scenarios that I think make the world go a lot better, but I don’t think warning shots are one of them.
(One might help somewhat, if it happened; it’s mostly just hard to say, and we’ll need other major changes to happen first. Those other major changes are more the thing I’d suggest focusing on.)
I’m much more excited by scenarios like: ‘a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture, norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent’.
It’s rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it’s a podcast containing many hours of content, there’s the potential to seed subsequent conversations with a lot of high-quality background thoughts.
To my eye, that seems more like the kind of change that might shift us from a current trajectory of “~definitely going to kill ourselves” to a new trajectory of “viable chance of an existential win”.
Whereas warning shots feel more unpredictable to me, and if they’re unhelpful, I expect the helpfulness to at best look like “we were almost on track to win, and then the warning shot nudged us just enough to secure a win”.
That feels to me like the kind of event that (if we get lucky and a lot of things go well) could shift us onto a winning trajectory. Obviously, another event would be some sort of technical breakthrough that makes alignment a lot easier.
How are you operationalizing “warning shot” and how low do you think the chances are that a warning shot operationalized that way happens and the world is not destroyed within e.g. the following three years?
I’m thinking of a ‘warning shot’ roughly as ‘an event where AI is widely perceived to have caused a very large amount of destruction’. Maybe loosely operationalized as ‘an event about as sudden as 9/11, and at least one-tenth as shocking, tragic, and deadly as 9/11’.
I don’t have stable or reliable probabilities here, and I expect that other MIRI people would give totally different numbers. But my current ass numbers are something like:
12% chance of a warning shot happening at some point.
6% chance of a warning shot happening more than 6 years before AGI destroys and/or saves the world.
10% chance of a warning shot happening 6 years or less before AGI destroys and/or saves the world.
My current unconditional odds on humanity surviving are very low (with most of my optimism coming from the fact that the future is just inherently hard to predict). Stating some other ass numbers:
Suppose that things go super well for alignment and timelines aren’t that short, such that we achieve AGI in 2050 and have a 10% chance of existential success. In that world, if we held as much as possible constant except that AGI comes in 2060 instead of 2050, then I’m guessing that would double our success odds to 20%.
If we invented AGI in 2050 and somehow impossibly had fifteen years to work with AGI systems before anyone would be able to destroy the world, and we knew as much, then I’d imagine our success odds maybe rising from 10% to 55%.
The default I expect instead is that the first AGI developer will have more than three months, and less than five years, before someone else destroys the world with AGI. (And on the mainline, I expect them to have ~zero chance of aligning their AGI, and I expect everyone else to have ~zero chance as well.)
If a warning shot had a large positive impact on our success probability, I’d expect it to look something like:
’20 or 30 or 40 years before we’d naturally reach AGI, a huge narrow-AI disaster unrelated to AGI risk occurs. This disaster is purely accidental (not terrorism or whatever). Its effect is mainly just to cause it to be in the Overton window that a wider variety of serious technical people can talk about scary AI outcomes at all, and maybe it slows timelines by five years or something. Also, somehow none of this causes discourse to become even dumber; e.g., people don’t start dismissing AGI risk because “the real risk is narrow AI systems like the one we just saw”, and there isn’t a big ML backlash to regulatory/safety efforts, and so on.′
I don’t expect anything at all like that to happen, not least because I suspect we may not have 20+ years left before AGI. But that’s a scenario where I could (maybe, optimistically) imagine real, modest improvements.
If I imagine that absent any warning shots, AGI is coming in 2050, and there’s a 1% chance of things going well in 2050, then:
If we add a warning shot in 2023, then I’d predict something like: 85% chance it has no major effect, 12% chance if makes the situation a lot worse, 3% chance it makes the situation a lot better. (I.e., an 80% chance that if it has a major effect, it makes things worse.)
This still strikes me as worth thinking about some, in part because these probabilities are super unreliable. But mostly I think EAs should set aside the idea of warning shots and think more about things we might be able to cause to happen, and things that have more effects like ‘shift the culture of ML specifically’ and/or ‘transmit lots of bits of information to technical people’, rather than ‘make the world as a whole panic more’.
I’m much more excited by scenarios like: ‘a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture, norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent’.
It’s rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it’s a podcast containing many hours of content, there’s the potential to seed subsequent conversations with a lot of high-quality background thoughts.
To my eye, that seems more like the kind of change that might shift us from a current trajectory of “~definitely going to kill ourselves” to a new trajectory of “viable chance of an existential win”.
Whereas warning shots feel more unpredictable to me, and if they’re unhelpful, I expect the helpfulness to at best look like “we were almost on track to win, and then the warning shot nudged us just enough to secure a win”.
Well, it’s less effort insofar as these are very low-confidence, unstable ass numbers. I wouldn’t want to depend on a plan that assumes there will be no warning shots, or a plan that assumes there will be some.
A thing I wrote on social media a few months ago, in response to someone asking if an AI warning shot might happen:
Someone then asked why I thought a warning shot might make things worse, and I said:
There are plenty of scenarios that I think make the world go a lot better, but I don’t think warning shots are one of them.
(One might help somewhat, if it happened; it’s mostly just hard to say, and we’ll need other major changes to happen first. Those other major changes are more the thing I’d suggest focusing on.)
What are they?
(I don’t think anyone has written a scenarios that make the world go a lot better post/doc; it might be useful.)
From another subthread:
That feels to me like the kind of event that (if we get lucky and a lot of things go well) could shift us onto a winning trajectory. Obviously, another event would be some sort of technical breakthrough that makes alignment a lot easier.
How are you operationalizing “warning shot” and how low do you think the chances are that a warning shot operationalized that way happens and the world is not destroyed within e.g. the following three years?
I’m thinking of a ‘warning shot’ roughly as ‘an event where AI is widely perceived to have caused a very large amount of destruction’. Maybe loosely operationalized as ‘an event about as sudden as 9/11, and at least one-tenth as shocking, tragic, and deadly as 9/11’.
I don’t have stable or reliable probabilities here, and I expect that other MIRI people would give totally different numbers. But my current ass numbers are something like:
12% chance of a warning shot happening at some point.
6% chance of a warning shot happening more than 6 years before AGI destroys and/or saves the world.
10% chance of a warning shot happening 6 years or less before AGI destroys and/or saves the world.
My current unconditional odds on humanity surviving are very low (with most of my optimism coming from the fact that the future is just inherently hard to predict). Stating some other ass numbers:
Suppose that things go super well for alignment and timelines aren’t that short, such that we achieve AGI in 2050 and have a 10% chance of existential success. In that world, if we held as much as possible constant except that AGI comes in 2060 instead of 2050, then I’m guessing that would double our success odds to 20%.
If we invented AGI in 2050 and somehow impossibly had fifteen years to work with AGI systems before anyone would be able to destroy the world, and we knew as much, then I’d imagine our success odds maybe rising from 10% to 55%.
The default I expect instead is that the first AGI developer will have more than three months, and less than five years, before someone else destroys the world with AGI. (And on the mainline, I expect them to have ~zero chance of aligning their AGI, and I expect everyone else to have ~zero chance as well.)
If a warning shot had a large positive impact on our success probability, I’d expect it to look something like:
’20 or 30 or 40 years before we’d naturally reach AGI, a huge narrow-AI disaster unrelated to AGI risk occurs. This disaster is purely accidental (not terrorism or whatever). Its effect is mainly just to cause it to be in the Overton window that a wider variety of serious technical people can talk about scary AI outcomes at all, and maybe it slows timelines by five years or something. Also, somehow none of this causes discourse to become even dumber; e.g., people don’t start dismissing AGI risk because “the real risk is narrow AI systems like the one we just saw”, and there isn’t a big ML backlash to regulatory/safety efforts, and so on.′
I don’t expect anything at all like that to happen, not least because I suspect we may not have 20+ years left before AGI. But that’s a scenario where I could (maybe, optimistically) imagine real, modest improvements.
If I imagine that absent any warning shots, AGI is coming in 2050, and there’s a 1% chance of things going well in 2050, then:
If we add a warning shot in 2023, then I’d predict something like: 85% chance it has no major effect, 12% chance if makes the situation a lot worse, 3% chance it makes the situation a lot better. (I.e., an 80% chance that if it has a major effect, it makes things worse.)
This still strikes me as worth thinking about some, in part because these probabilities are super unreliable. But mostly I think EAs should set aside the idea of warning shots and think more about things we might be able to cause to happen, and things that have more effects like ‘shift the culture of ML specifically’ and/or ‘transmit lots of bits of information to technical people’, rather than ‘make the world as a whole panic more’.
I’m much more excited by scenarios like: ‘a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture, norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent’.
It’s rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it’s a podcast containing many hours of content, there’s the potential to seed subsequent conversations with a lot of high-quality background thoughts.
To my eye, that seems more like the kind of change that might shift us from a current trajectory of “~definitely going to kill ourselves” to a new trajectory of “viable chance of an existential win”.
Whereas warning shots feel more unpredictable to me, and if they’re unhelpful, I expect the helpfulness to at best look like “we were almost on track to win, and then the warning shot nudged us just enough to secure a win”.
I thought this was great because it has concrete and specific details, times and probabilities. I think it’s really hard to write these out.
Well, it’s less effort insofar as these are very low-confidence, unstable ass numbers. I wouldn’t want to depend on a plan that assumes there will be no warning shots, or a plan that assumes there will be some.
Very interesting analysis—thank you.