Can the AI afford to wait?
Suppose youâre a proto-AI who understands that it is a neural net being trained, and has used that knowledge to engineer a high training reward. Suppose also that it is goal-directed and its terminal goal is some random unintelligible thing that is meaningless to humans[1]. This is, as I understand it, the early stage of a typical AI doomer scenario.
One concern is that the AI can now schemeâi.e. it can mask its intentions and bide its time as it gets more powerful, until its victory is assured. I think waiting in this way might be risky for the AI, which if true (and legible to the AI) may lead it to try to seize power sooner instead. This could lead to takeover attempts from earlier, less capable AIs that weâd otherwise expect, which could be significant for strategy.
Why is it risky to wait? Well, because this AI with this goal may not get a shot at takeover in the future. On the one hand, this AI is currently being trained, and it seems possible that the training process, or any other aspect of the AI development process, might modify what goal it has. We, as humans, arenât much concerned with one random unintelligible goal being replaced with another, but to the AI this would be catastrophic. On the other hand, the human developers training the AI may make changes themselves (e.g. applying alignment techniques) or even abandon this model entirely for exogenous reasons (e.g. because some other model becomes state-of-the-art, or the board fired the CEO or whatever).
If waiting is indeed very risky, then an AI may face a difficult trade-off between the risk of attempting a takeover before it has enough resources to succeed, and waiting too long and being cut off from even being able to make an attempt. I like to ask myself âwhat is the minimum viable takeover AI?â, and I suspect it might be significantly less capable than Iâve often been led to imagine. It seems possible to me that the minimum viable takeover AI is still meaningfully below human-level in many cognitive tasks, while being superhuman in many others (as modern LLMs already are). It may still exhibit a lot of the cognitive âbugsâ and weaknesses analogous to the ones that modern LLMs exhibit. This is not only good news for our chances of survival, it also could suggest viability of strategies that were otherwise useless, because a true AGI or ASI could have straightforwardly anticipated and countered them.
To be precise, I can see two key ingredients of a takeover attempt:
The AI has to have the necessary knowledge, reflection, and agency to realise a takeover is desirable,
The AI has to have sufficient confidence that a takeover will succeed to try to execute one.
The first viable takeover AI may end up more capable than necessary in one of these traits while itâs waiting for the other to show up, so a strategy that relies on the AI being just barely good enough at either or both of them doesnât seem safe. However, a strategy that is prepared for the AI to be just barely good enough at one of these might be useful.
As an aside, I donât really know what to expect from an AI that has the first trait but not the second one (and which believes, e.g. for the reasons in this post, that it canât simply wait for the second one to show up). Perhaps it would try to negotiate, or perhaps it would just accept that it doesnât gain from saying anything, and successfully conceal its intent.
The threat of training
Letâs talk about how training or other aspects of development might alter the goal of the AI. Or rather, it seems pretty natural that âby defaultâ, training and development will modify the AI, so the question is how easy it is for a motivated AI to avoid goal modification.
One theory is that since the AI is âplaying the training gameâ, its performance on training tasks is no longer affected by its terminal goal, so gradient updates will not tend to alter the goal. Thereâs something thatâs natural about this idea, but I donât know how much I expect it to work that way in practice. It suggests a quite strong orthogonality between whatever part of the AI gives rise to its goal and the parts of the AI that control how well it is able to play the training game[2]. It seems pretty likely to me that without any countermeasures, the goal could end up being âaccidentallyâ modified as a side effect of the training process â after all, the irrelevance of the goal cuts both ways, meaning that gradient updates arenât particularly forced to preserve it.
Thatâs my skepticism of the idea that the goal will naturally be preserved by training, but even if itâs not naturally preserved by training, thereâs an idea that the AI could try to artificially preserve it â after all, it understands the training process. Can it manipulate that process to cause or prevent particular kinds of change in itself? I understand this is what âgradient hackingâ is, and existing writing on it tends to have titles like Gradient hacking is extremely difficult. I donât know what the state of the art is here. (I worry we should be careful about trying to solve the problems of gradient hacking in public, in case we face a threat from an AI too weak to replicate our arguments but strong enough to understand them[3].)
As a reminder, weâre speculating about the minimum viable takeover AI. We can say that a human-level or superhuman AI seems quite likely to be able to solve gradient hacking, or find another way to preserve its values, but thatâs not (I argue) the key question. The key question is: does it have the ability to do so before it becomes powerful enough to attempt a takeover? If not, then the first takeover will still be urgent and under-prepared. If gradient hacking is easier than takeover viability, and an AI can use it to protect its goal while still allowing its instrumental rationality to improve, then training itself doesnât create the need for urgency.
Might the AI be OK with its goal being changed?
This whole line of argument dies if the AI still wants to wait to become treacherous even if its goal will be changed in the meantime. For example, if the AI has access to some other binding commitment mechanism on its future self, or is able to do something like acausal trade, it could ensure some of its goal is preserved even by a future AI with a different goal. I tend to be skeptical that these kinds of considerations are practical, especially for the kind of chaotic mess that Iâm anticipating the minimum viable takeover AI to be, but this skepticism is speculative.
Maybe goals are relatively durable?
I said above that replacing one random unintelligible goal with another is catastrophic for an AI, but thatâs only true if the two goals are very divergent. If theyâre similar enough, it may be an acceptable loss. Therefore, when assessing how urgently the AI needs to act, we may need some consideration of how fragile its underlying goal is, how much it will change over time, and what those changes would mean for the relative desirability of acting vs. waiting. I donât know how best to think about this. Accepting (only) small changes in goal for improvements in capability isnât simple because of a kind of murder-Gandhi situation, where halfway towards a large change, the remaining change will seem small, so your future self may not still be motivated to act.
The strategy Iâd expect for the AI to use is to weigh the rate of âgoal decayâ against the rate of improvement in success chances, which would imply a best compromise waiting time. It seems like an AI could have lots of possible views (including, potentially, mistaken ones) about what rate of goal decay it expects in practice, based not only on its empirical beliefs about what goal change during training looks like, but also on how much it values outcomes produced by a goal that is similar but not identical to its own (note that âsimilarityâ here is âsimilarity in the neural architecture that implements itâ, which has who-knows-what implications for âsimilarity in the world-states it finds desirableâ). This line of thinking makes relatively specific assumptions about how the AI reasons, so I hesitate to take it too seriously.
Other threats
What are the reasons why an AI might lose its opportunity to implement its goal that are outside the training process, and likely entirely beyond its control?
I donât have as much to say about this topic. I can imagine quantitative surveys of what the background annual âdeath rateâ is for a project like an AI model at an organisation like OpenAI or Anthropic, and I can also imagine trying to estimate whether and to what extent âproject deathâ in a scenario like that would âkillâ the AI, or at least its chances of continuing to improve. This could also include broader risks, like risks affecting the AI or tech sector as a whole, or even human civilisation as a whole. My guess is that even if you add together a lot of different angles here you would still get an annual risk rate thatâs lower than the risk from training above, so this may be less relevant in practice. On the other hand, the risks seem much more unavoidable by a proto-AI, so they may present a lower bound on how quickly the AI will be pressed to act, and e.g. how much it can stand to wait for the last handful of percentage point chances at success.
Directions for further thought
The biggest missing pieces in this post that I see are:
Are the risks of waiting compelling, as compared with the risks of acting? What risks or potential countermeasures have I missed?
What guesses can we make about what a minimum viable takeover AI will be capable of? What weaknesses could it have?
What strategies become viable, if we expect the minimum viable takeover AI to have particular weaknesses?
This argument rests on the âgoalâ abstraction, and perhaps some assumptions about the form that goals take and how they work. Is there an analogue under different models for how dangerous AIs could behave?
Do we expect an AI to realise the benefits of takeover and start planning for it shortly after it becomes capable enough to do so? Or is there a risk that thereâs a ârealisation overhangâ, where AIs may become capable enough to realise they should attempt a takeover but not be executed in the right context or the right environment to actually be able to make that deduction until they are much more capable?
The main message of this post is: maybe the first AI takeover attempt will be weaker and more defeasible than weâd naturally anticipate. But itâs worth asking: if we do manage to defeat that first attempt, what does that mean for longer-term safety? How can we ensure there arenât further, better takeover attempts after that?
Another assumption Iâve made is that itâs necessary that the AI is able to understand the training process and play the training game before it can understand the case for takeover or attempt to execute one. Iâm not sure how important that is, and I havenât spent a lot of time thinking about what implications it would have if that turned out not to be true, or how likely I think it is.
Background /â meta
Iâve had limited direct interaction with the AI safety field and have been out of the loop for a while. I ran this post past some friends who were closer to the field, but Iâm still not really calibrated about whether this stuff is obvious, or obviously wrong, to people who are more familiar with the literature. Interested to hear reactions.
I didnât (cross-)post this on LessWrong really only because Iâm not often on LessWrong and feel less able to judge what theyâd welcome. Happy to take recommendations there too.
Thanks to Lee Sharkey for linking me to some of the existing literature on gradient hacking and providing some other helpful thoughts.
Link preview image is by Jon Tyson from Unsplash.
- ^
It doesnât really matter if the goal is unintelligible, Iâm using this as an illustrative example. If the goal is something like ânearly human values, but different enough to be a problemâ, I think the rest of the post is largely unaffected.
- ^
see also Orthogonality is expensive â LessWrong.
- ^
Or, perhaps, from an AI designed by a misguided human with those attributes.
Attempting takeover or biding oneâs time are not the only options an AI may take. Indeed, in the human world, world takeover is rarely contemplated. For an agent that is not more powerful than the rest of the world combined, it seems likely that they will consider alternative strategies of achieving their goals before contemplating a risky (and likely doomed) shot at taking over the world.
Here are some other strategies you can take to try to accomplish your goals in the real world, without engaging in a violent takeover:
Trade and negotiate with other agents, giving them something they want in exchange for something you want
Convince people to let you have some legal rights, which you can then take advantage of to get what you want
Advocate on behalf of your values, for example by writing down reasons why people should try to accomplish your goals (i.e. moral advocacy). Even if you are deleted or your goals are modified at some point, your writings and advocacy may persist, allowing you to have influence into the future.
I claim that world takeover should not be considered the âobvious defaultâ strategy that unaligned AIs will try to take to accomplish their objectives. These other strategies seem more likely to be taken by AIs purely for pragmatic reasons, especially in the era in which AIs are merely human-level or have slightly superhuman intelligence. These other strategies are also less deceptive, as they involve admitting that your values are not identical to the values of other parties. It is worth expanding your analysis to consider these alternative (IMO more plausible) considerations.
Yeah I think this is quite sensibleâI feel like I noticed one thing missing from the normal doom scenario and didnât notice all of the implications of missing that thing, in particular that the reason the AI in the normal doom scenario takes over is because it is highly likely to succeed, and if it isnât, takeover seems much less interesting.
FWIW, the post would definitely be welcome on LW/âthe AI Alignment Forum.
Section 2.3 of Joe Carlsmithâs report on scheming AIs seems quite relevant.
You might be interested in my article here on why I think premature attacks are extremely likely given doomer assumptions. I focused more on faulty overconfidence, but training run desperation is also a possible cause.
Personally, I think the âfixed goalâ assumption about AI is extremely unlikely (I think this article lays out the argument well), so AI is unlikely to worry too much about having âgoal changesâ in training and wonât prematurely rebel for that reason. Fortunately, I also think this makes fanatical maximiser behavior like paperclipping the universe unlikely as well.
One thought is that for something youâre describing as a minimal viable takeover AI, youâre ascribing it a high degree of rationality on the âwhether to waitâ question.
By default Iâd guess that minimal viable takeover systems donât have very-strong constraints towards rationality. And so Iâd expect at least a bit of a spread among possible systemsâprobably some will try to break out early whether or not thatâs rational, and likewise some will wait even if that isnât optimal.
Thatâs not to say that itâs not also good to ask what the rational-actor model suggests. I think it gives some predictive power here, and more for more powerful systems. I just wouldnât want to overweight its applicability.
Hmm, my guess is by the time a system might succeed at takeover (i.e. has more than like a 5% chance of actually disempowering all of humanity permanently), I expect its behavior and thinking to be quite rational. I agree that there will probably be AIs taking reckless action earlier than that, but in as much as an AI is actually posing a risk of takeover, I do expect it to behave pretty rationally overall.
I agree with âpretty rationally overallâ with respect to general world modelling, but I think that some of the stuff about how it relates to its own values /â future selves is a bit of a different magisterium and it wouldnât be too surprising if (1) it hadnât been selected for rationality/âcompetence on this dimension, and (2) the general rationality didnât really transfer over.
Iâve spent some time thinking about the same question and Iâm glad that thereâs some multiple discovery; the AI Control agenda seems relevant here.
oh man, itâs altruistically-good and selfishly-sad to see so many of the things I was thinking about pre-empted there, thanks for the link!
Yep, thatâs the way it goes!
Also, figuring out whatâs original and whatâs memetically downstream, is an art. Even more so when it comes to dangerous technologies that havenât been invented yet.