Quotes: Recent discussions of backfire risks in AI safety
Some thinkers in AI safety have recently pointed out various backfire effects that attempts to reduce AI x-risk can have. I think pretty much all of these effects were known before,[1] but it’s helpful to have them front of mind. In particular, I’m skeptical that we can weigh these effects against the upsides precisely enough to say an AI x-risk intervention is positive or negative in expectation, without making an arbitrary call. (Even if our favorite intervention doesn’t have these specific downsides, we should ask if we’re pricing in the downsides (and upsides) we haven’t yet discovered.)
Holden: I mean, take any project. Let’s just take something that seems really nice, like alignment research. You’re trying to detect if the AI is scheming against you and make it not scheme against you. Maybe that’ll be good. But maybe the thing you’re doing is something that is going to get people excited, and then they’re going to try it instead of doing some other approach. And then it doesn’t work, and the other approach would have worked. Well, now you’ve done tremendous harm. Maybe it will work fine, but it will give people a false sense of security, make them think the problem is solved more than it is, make them move on to other things, and then you’ll have a tremendous negative impact that way.
Rob Wiblin: Maybe it’ll be used by a human group to get more control, to more reliably be able to direct an AI to do something and then do a power grab.
Holden: [...] Maybe it would have been great if the AIs took over the world. Maybe we’ll build AIs that are not exactly aligned with humans; they’re actually just much better — they’re kind of like our bright side, they’re the side we wish we were. [...]
[… M]aybe alignment is just a really… What it means is that you’re helping make sure that someone who’s intellectually unsophisticated — that’s us, that’s humans — remains forever in control of the rest of the universe and imposes whatever dumb ideas we have on it forevermore, instead of having our future evolve according to things that are much more sophisticated and better reasoners following their own values.
[...]
Holden: I just think AI is too multidimensional, and there’s too many considerations pointing in opposite directions. I’m worried about AIs taking over the world, but I’m also worried about the wrong humans taking over the world. And a lot of those things tend to offset each other, and making one better can make the other worse. [...]
[T]here’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct.
[...]
Option value in the policy world is kind of a bad concept anyway. A lot of times when you’re at a nonprofit or a company and you don’t know what to do, you try and preserve option value. But giving the government the option to go one way or the other, that’s not a neutral intervention — it’s just like you don’t know what they’re going to do with that option. Giving them the option could have been bad. … you don’t know who’s going to be in power when, and whether they’re going to have anything like the goals that you had when you put in some power that they had. I know people have been excited at various points about giving government more power and then at other points giving government less power.
And all this stuff, I mean, this one axis you’re talking about: centralisation of power versus decentralisation. Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative [...]
And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like.
[… I]n AI, it’s easier to annoy someone and polarise them against you, because whatever it is you’re trying to do, there’s some coalition that’s trying to do the exact opposite. In certain parts of global health and farm animal welfare, there’s certainly people who want to prioritise it less, but it doesn’t have the same directional ambiguity.
Helen: And I think there’s a natural tension here as well among some people who are very concerned about existential risk from AI, really bad outcomes, and AI safety: there’s this sense that it’s actually helpful if there’s only a smaller number of players. Because, one, they can coordinate better — so maybe if racing leads to riskier outcomes, if you just have two top players, they can coordinate more directly than if you have three or four or 10 — and also a smaller number of players is going to be easier for an outside body to regulate, so if you just have a small number of companies, that’s going to be easier to regulate.
[...] But the problem is then the “Then what?” question of, if you do manage to avoid some of those worst-case outcomes, and then you have this incredibly powerful technology in the hands of a very small number of people, I think just historically that’s been really bad. It’s really bad when you have small groups that are very powerful, and typically it doesn’t result in good outcomes for the rest of the world and the rest of humanity.
[...]
Rob: I feel like we’re in a very difficult spot, because so many of the obvious solutions that you might have, or approaches you might take to dealing with loss of control do make the concentration of power problem worse and vice versa. So what policies you favour and disfavour depends quite sensitively on the relative risk of these two things, the relative likelihood of things going negatively in one way versus the other way.
And at least on the loss of control thing, people disagree so much on the likelihood. People who are similarly informed, know about everything there is to know about this, go all the way from thinking it’s a 1-in-1,000 chance to it’s a 1-in-2 chance — a 0.1% likelihood to 50% chance that we have some sort of catastrophic loss of control. And discussing it leads sometimes to some convergence, but people just have not converged on a common sense of how likely this outcome is.
So the people who think it’s 50% likely that we have some catastrophic loss-of-control event, it’s understandable that they think, “Well, we just have to make the best of it. Unfortunately, we have to concentrate. It’s the only way. And the concentration of power stuff is very sad and going to be a difficult issue to deal with, but we have to bear that cost.” And people who think it’s one in 1,000 are going to say, “This is a terrible move that you’re making, because we’re accepting much more risk, we’re creating much more risk than we’re actually eliminating.”
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
Among other sources, see my compilations of backfire effects here and here, and discussion of downside risks of capacity-building / aiming for “option value” or “wiser futures” here.
Quotes: Recent discussions of backfire risks in AI safety
Some thinkers in AI safety have recently pointed out various backfire effects that attempts to reduce AI x-risk can have. I think pretty much all of these effects were known before,[1] but it’s helpful to have them front of mind. In particular, I’m skeptical that we can weigh these effects against the upsides precisely enough to say an AI x-risk intervention is positive or negative in expectation, without making an arbitrary call. (Even if our favorite intervention doesn’t have these specific downsides, we should ask if we’re pricing in the downsides (and upsides) we haven’t yet discovered.)
(Emphasis mine, in all the quotes below.)
Holden’s Oct 2025 80K interview:
Helen Toner’s Nov 2025 80K interview:
Wei Dai, “Legible vs. Illegible AI Safety Problems”:
Among other sources, see my compilations of backfire effects here and here, and discussion of downside risks of capacity-building / aiming for “option value” or “wiser futures” here.