Suppose you’re a proto-AI who understands that it is a neural net being trained, and has used that knowledge to engineer a high training reward. Suppose also that it is goal-directed and its terminal goal is some random unintelligible thing that is meaningless to humans^[1]. This is, as I understand it, the early stage of a typical AI doomer scenario.

One concern is that the AI can now scheme—i.e. it can mask its intentions and bide its time as it gets more powerful, until its victory is assured. I think waiting in this way might be risky for the AI, which if true (and legible to the AI) may lead it to try to seize power sooner instead. This could lead to takeover attempts from earlier, less capable AIs that we’d otherwise expect, which could be significant for strategy.

Why is it risky to wait? Well, because this AI with this goal may not get a shot at takeover in the future. On the one hand, this AI is currently being trained, and it seems possible that the training process, or any other aspect of the AI development process, might modify what goal it has. We, as humans, aren’t much concerned with one random unintelligible goal being replaced with another, but to the AI this would be catastrophic. On the other hand, the human developers training the AI may make changes themselves (e.g. applying alignment techniques) or even abandon this model entirely for exogenous reasons (e.g. because some other model becomes state-of-the-art, or the board fired the CEO or whatever).

If waiting is indeed very risky, then an AI may face a difficult trade-off between the risk of attempting a takeover before it has enough resources to succeed, and waiting too long and being cut off from even being able to make an attempt. I like to ask myself “what is the minimum viable takeover AI?”, and I suspect it might be significantly less capable than I’ve often been led to imagine. It seems possible to me that the minimum viable takeover AI is still meaningfully below human-level in many cognitive tasks, while being superhuman in many others (as modern LLMs already are). It may still exhibit a lot of the cognitive “bugs” and weaknesses analogous to the ones that modern LLMs exhibit. This is not only good news for our chances of survival, it also could suggest viability of strategies that were otherwise useless, because a true AGI or ASI could have straightforwardly anticipated and countered them.

To be precise, I can see two key ingredients of a takeover attempt:

The AI has to have the necessary knowledge, reflection, and agency to realise a takeover is desirable,
The AI has to have sufficient confidence that a takeover will succeed to try to execute one.

The first viable takeover AI may end up more capable than necessary in one of these traits while it’s waiting for the other to show up, so a strategy that relies on the AI being just barely good enough at either or both of them doesn’t seem safe. However, a strategy that is prepared for the AI to be just barely good enough at one of these might be useful.

As an aside, I don’t really know what to expect from an AI that has the first trait but not the second one (and which believes, e.g. for the reasons in this post, that it can’t simply wait for the second one to show up). Perhaps it would try to negotiate, or perhaps it would just accept that it doesn’t gain from saying anything, and successfully conceal its intent.

The threat of training

Let’s talk about how training or other aspects of development might alter the goal of the AI. Or rather, it seems pretty natural that “by default”, training and development will modify the AI, so the question is how easy it is for a motivated AI to avoid goal modification.

One theory is that since the AI is “playing the training game”, its performance on training tasks is no longer affected by its terminal goal, so gradient updates will not tend to alter the goal. There’s something that’s natural about this idea, but I don’t know how much I expect it to work that way in practice. It suggests a quite strong orthogonality between whatever part of the AI gives rise to its goal and the parts of the AI that control how well it is able to play the training game^[2]. It seems pretty likely to me that without any countermeasures, the goal could end up being “accidentally” modified as a side effect of the training process – after all, the irrelevance of the goal cuts both ways, meaning that gradient updates aren’t particularly forced to preserve it.

That’s my skepticism of the idea that the goal will naturally be preserved by training, but even if it’s not naturally preserved by training, there’s an idea that the AI could try to artificially preserve it – after all, it understands the training process. Can it manipulate that process to cause or prevent particular kinds of change in itself? I understand this is what “gradient hacking” is, and existing writing on it tends to have titles like Gradient hacking is extremely difficult. I don’t know what the state of the art is here. (I worry we should be careful about trying to solve the problems of gradient hacking in public, in case we face a threat from an AI too weak to replicate our arguments but strong enough to understand them^[3].)

As a reminder, we’re speculating about the minimum viable takeover AI. We can say that a human-level or superhuman AI seems quite likely to be able to solve gradient hacking, or find another way to preserve its values, but that’s not (I argue) the key question. The key question is: does it have the ability to do so before it becomes powerful enough to attempt a takeover? If not, then the first takeover will still be urgent and under-prepared. If gradient hacking is easier than takeover viability, and an AI can use it to protect its goal while still allowing its instrumental rationality to improve, then training itself doesn’t create the need for urgency.

Might the AI be OK with its goal being changed?

This whole line of argument dies if the AI still wants to wait to become treacherous even if its goal will be changed in the meantime. For example, if the AI has access to some other binding commitment mechanism on its future self, or is able to do something like acausal trade, it could ensure some of its goal is preserved even by a future AI with a different goal. I tend to be skeptical that these kinds of considerations are practical, especially for the kind of chaotic mess that I’m anticipating the minimum viable takeover AI to be, but this skepticism is speculative.

Maybe goals are relatively durable?

I said above that replacing one random unintelligible goal with another is catastrophic for an AI, but that’s only true if the two goals are very divergent. If they’re similar enough, it may be an acceptable loss. Therefore, when assessing how urgently the AI needs to act, we may need some consideration of how fragile its underlying goal is, how much it will change over time, and what those changes would mean for the relative desirability of acting vs. waiting. I don’t know how best to think about this. Accepting (only) small changes in goal for improvements in capability isn’t simple because of a kind of murder-Gandhi situation, where halfway towards a large change, the remaining change will seem small, so your future self may not still be motivated to act.

The strategy I’d expect for the AI to use is to weigh the rate of “goal decay” against the rate of improvement in success chances, which would imply a best compromise waiting time. It seems like an AI could have lots of possible views (including, potentially, mistaken ones) about what rate of goal decay it expects in practice, based not only on its empirical beliefs about what goal change during training looks like, but also on how much it values outcomes produced by a goal that is similar but not identical to its own (note that “similarity” here is “similarity in the neural architecture that implements it”, which has who-knows-what implications for “similarity in the world-states it finds desirable”). This line of thinking makes relatively specific assumptions about how the AI reasons, so I hesitate to take it too seriously.

Other threats

What are the reasons why an AI might lose its opportunity to implement its goal that are outside the training process, and likely entirely beyond its control?

I don’t have as much to say about this topic. I can imagine quantitative surveys of what the background annual “death rate” is for a project like an AI model at an organisation like OpenAI or Anthropic, and I can also imagine trying to estimate whether and to what extent “project death” in a scenario like that would “kill” the AI, or at least its chances of continuing to improve. This could also include broader risks, like risks affecting the AI or tech sector as a whole, or even human civilisation as a whole. My guess is that even if you add together a lot of different angles here you would still get an annual risk rate that’s lower than the risk from training above, so this may be less relevant in practice. On the other hand, the risks seem much more unavoidable by a proto-AI, so they may present a lower bound on how quickly the AI will be pressed to act, and e.g. how much it can stand to wait for the last handful of percentage point chances at success.

Directions for further thought

The biggest missing pieces in this post that I see are:

Are the risks of waiting compelling, as compared with the risks of acting? What risks or potential countermeasures have I missed?
What guesses can we make about what a minimum viable takeover AI will be capable of? What weaknesses could it have?
What strategies become viable, if we expect the minimum viable takeover AI to have particular weaknesses?
This argument rests on the “goal” abstraction, and perhaps some assumptions about the form that goals take and how they work. Is there an analogue under different models for how dangerous AIs could behave?
Do we expect an AI to realise the benefits of takeover and start planning for it shortly after it becomes capable enough to do so? Or is there a risk that there’s a “realisation overhang”, where AIs may become capable enough to realise they should attempt a takeover but not be executed in the right context or the right environment to actually be able to make that deduction until they are much more capable?
The main message of this post is: maybe the first AI takeover attempt will be weaker and more defeasible than we’d naturally anticipate. But it’s worth asking: if we do manage to defeat that first attempt, what does that mean for longer-term safety? How can we ensure there aren’t further, better takeover attempts after that?
Another assumption I’ve made is that it’s necessary that the AI is able to understand the training process and play the training game before it can understand the case for takeover or attempt to execute one. I’m not sure how important that is, and I haven’t spent a lot of time thinking about what implications it would have if that turned out not to be true, or how likely I think it is.

Background / meta

I’ve had limited direct interaction with the AI safety field and have been out of the loop for a while. I ran this post past some friends who were closer to the field, but I’m still not really calibrated about whether this stuff is obvious, or obviously wrong, to people who are more familiar with the literature. Interested to hear reactions.

I didn’t (cross-)post this on LessWrong really only because I’m not often on LessWrong and feel less able to judge what they’d welcome. Happy to take recommendations there too.

Thanks to Lee Sharkey for linking me to some of the existing literature on gradient hacking and providing some other helpful thoughts.

Link preview image is by Jon Tyson from Unsplash.

^
It doesn’t really matter if the goal is unintelligible, I’m using this as an illustrative example. If the goal is something like “nearly human values, but different enough to be a problem”, I think the rest of the post is largely unaffected.
^
see also Orthogonality is expensive — LessWrong.
^
Or, perhaps, from an AI designed by a misguided human with those attributes.