Aim for conditional pauses

TL;DR: I argue for two main theses:

  1. [Moderate-high confidence] It would be better to aim for a conditional pause, where a pause is triggered based on evaluations of model ability, rather than an unconditional pause (e.g. a blanket ban on systems more powerful than GPT-4).

  2. [Moderate confidence] It would be bad to create significant public pressure for a pause through advocacy, because this would cause relevant actors (particularly AGI labs) to spend their effort on looking good to the public, rather than doing what is actually good.

Since mine is one of the last posts of the AI Pause Debate Week, I’ve also added a section at the end with quick responses to the previous posts.

Which goals are good?

That is, ignoring tractability and just assuming that we succeed at the goal—how good would that be? There are a few options:

Full steam ahead. We try to get to AGI as fast as possible: we scale up as quickly as we can; we only spend time on safety evaluations to the extent that it doesn’t interfere with AGI-building efforts; we open source models to leverage the pool of talent not at AGI labs.

Quick take. I think this would be bad, as it would drastically increase x-risk.

Iterative deployment. We treat AGI like we would treat many other new technologies: something that could pose risks, which we should think about and mitigate, but ultimately something we should learn about through iterative deployment. The default is to deploy new AI systems, see what happens with a particular eye towards noticing harms, and then design appropriate mitigations. In addition, rollback mechanisms ensure that we can AI systems are deployed with a rollback mechanism, so that if a deployment causes significant harms

Quick take. This is better than full steam ahead, because you could notice and mitigate risks before they become existential in scale, and those mitigations could continue to successfully prevent risks as capabilities improve[1].

Conditional pause. We institute regulations that say that capability improvement must pause once the AI system hits a particular threshold of riskiness, as determined by some relatively standardized evaluations, with some room for error built in. AI development can only continue once the developer has exhibited sufficient evidence that the risk will not arise.

For example, following ARC Evals, we could evaluate the ability of an org’s AI systems to autonomously replicate, and the org would be expected to pause when they reach a certain level of ability (e.g. the model can do 80% of the requisite subtasks with 80% reliability), until they can show that the associated risks won’t arise.

Quick take. Of course my take would depend on the specific details of the regulations, but overall this seems much better than iterative deployment. Depending on the details, I could imagine it taking a significant bite out of overall x-risk. The main objections which I give weight to are the overhang objection (faster progress once the pause stops) and the racing objection (a pause gives other, typically less cautious actors more time to catch up and intensify or win a capabilities race), but overall these seem less bad than not stopping when a model looks like it could plausibly be very dangerous.

Unconditional temporary pause. We institute regulations that ban the development of AI models over some compute threshold (e.g. “more powerful than GPT-4”). Every year, the minimum resources necessary to destroy the world drops by 0.5 OOMs[2], and so we lower the threshold over time. Eventually AGI is built, either because we end the pause in favor of some new governance regime (that isn’t a pause), or because the compute threshold got low enough that some actor flouted the law and built AGI.

Quick take. I’ll discuss this further below, but I think this is clearly inferior to a conditional pause, and it’s debatable to me whether it is better or worse than iterative deployment.

Unconditional permanent pause. At the extreme end, we could imagine a complete ban on making progress towards AGI. We aren’t just avoiding training new frontier models – we’re preventing design of new architectures or training algorithms, forecasting via scaling laws, study of existing model behaviors, improvements on chip design, etc. We’re monitoring research fields to nip any new paradigms for AGI in the bud[3].

Variant 1: This scenario starts like the unconditional temporary pause story above, but we use extreme surveillance once the compute threshold becomes very low to continue enforcing the pause.

Variant 2: In addition to this widespread pause, there is a tightly controlled and monitored government project aiming to build safe AGI.

Quick take. I don’t usually think about these scenarios, but my tentative take is that the scenario I outlined, or Variant 2 (but not Variant 1), would be the best outcome on this list. It implies a level of global coordination that would probably be sufficient to tackle other societal challenges, decreasing the importance of getting AGI soon, and we can likely obtain many of the benefits of AGI eventually by developing the relevant technologies ourselves with our human intelligence. (We could probably even get an intelligence explosion from digital people, e.g. through whole-brain emulation.) Nonetheless, I will set these scenarios aside for the rest of this post as I believe there is widespread consensus that this is not a realistic option available to us.

Why I prefer conditional over unconditional pauses

There are many significant advantages of a conditional pause, relative to an unconditional temporary pause:

  1. There is more time for safety research with access to powerful models (since you pause later, when you have more powerful models). This is a huge boost for many areas of alignment research that I’m excited about[4]. (This point has been discussed elsewhere via the “overhang” objection to an unconditional pause.)

  2. While I’m very uncertain, on balance I think it provides more serial time to do alignment research. As model capabilities improve and we get more legible evidence of AI risk, the will to pause should increase, and so the expected length of a pause should also increase[5].

  3. We get experience and practice with deploying and using more powerful models in the real world, which can be particularly helpful for handling misuse and structural risks, simply because we can observe what happens in practice rather than speculating on it from our armchairs[6].

  4. We can get much wider support for a conditional pause. Most people can get on board with the principle “if an AI system would be very dangerous, don’t build it”, and then the relevant details are about when a potential AI system should be considered plausibly very dangerous. At one end, a typical unconditional pause proposal would say “anything more powerful than GPT-4 should be considered plausibly very dangerous”. As you make the condition less restrictive and more obviously tied to harm, it provokes less and less opposition.

  5. It is easier to put in place, because it doesn’t require anyone to give up something they are already doing.

In contrast, the advantages of an unconditional pause seem relatively minor:

  1. Since it is simpler than the conditional pause, it is easier to explain.

    • This is a big advantage. While the conditional pause isn’t that much harder to explain, for public advocacy, I’d guess you get about five words, so even small differences lead to big advantages.

  2. Since it is simpler than the conditional pause, it is easier to implement correctly.

    • This doesn’t seem like a big deal, governments routinely create and enforce regulations that are way more complicated than a typical conditional pause proposal.

  3. Since there are fewer moving parts to manipulate, It is harder to game.

    • This might be a big deal in the longer run? But in the short term, I don’t expect the current leading companies to game such a regulation, because I expect they care at least a bit about safety (e.g. the Superalignment commitment is really unlikely in worlds where OpenAI doesn’t care at all, similarly for the signatures on the CAIS statement).

  4. There’s less risk that we fail to pause at the point when AI systems become x-risky.

    • We can choose a conditional pause proposal that makes this risk tiny, e.g. “pause once an agent passes 10 or more of the ARC Evals tasks[7].

Upon reading other content from this week, it seems like maybe a crux for people is whether the pause is a stopgap or an end state. My position stays the same regardless; in either case I’d aim for a conditional pause, for the same reasons.

What advocacy is good?

I think about three main types of advocacy to achieve these goals:

  1. Targeted advocacy within AGI labs and to the AI research community

  2. Targeted advocacy to policymakers

  3. Broad advocacy to the public

    • Note that I’m only considering advocacy with the aim of producing public pressure for a pause – I’m not considering e.g. advocacy that’s meant to produce new alignment researchers, or other such theories of change.

I think there is a fairly widespread consensus that (1) and (2) are good. I’d also highlight that (1) and (2) are working. For (1) look at the list of people who signed the CAIS statement. For (2), note the existence of the UK Frontier AI Taskforce and the people on it, as well as the intent bill SB 294 in California about “responsible scaling” (a term from ARC that I think refers to a set of policies that would include a conditional pause).

I think (3) is much less clear. I’d first note that the default result of advocacy to the public is failure; it’s really hard to reach a sizable fraction of the public (which is what’s needed to produce meaningful public pressure). But let’s consider the cases in which it succeeds. In these cases, I expect a mix of advantages and disadvantages. The advantages are:

  1. Will: There would be more pressure to get things done, in a variety of places (government, AGI labs, AI research community). So, more things get done.

(I’ve only put down one advantage, but this is a huge advantage! Don’t judge the proposal solely on the number of points I wrote down.)

The disadvantages are:

  1. Lack of nuance: Popular AI x-risk would not be nuanced. For example, I expect that the ask from the public would be an unconditional pause, because a conditional one would be too nuanced.

    • Note that it is possible to have a good cop /​ bad cop routine: build public pressure with a “bad cop” that pushes goals that lack nuance, and have a “good cop” use that pressure to achieve a better, more nuanced goal[8].

    • This is also discussed as “inside game” /​ “outside game” in Holly’s post.

  2. Goodhart’s Law (labs): Companies face large pressure to do safety work that would look good to the public, which is different from doing work that is actually good[9]. At the extreme, existing safety teams could be replaced by compliance and PR departments, which do the minimum required to comply with regulations and keep the public sufficiently happy (see also Why Not Slow AI Progress?).

    • This is already happening a little: public communication from labs is decided in part based on the expected response from AI x-risk worriers. It will get much worse when it expands to the full public, since the public is a much more important stakeholder, and has much less nuance.

    • This seems especially bad for AI alignment research, where the best work is technical, detail-oriented, and relatively hard to explain.

  3. Goodhart’s Law (govts): Similarly, policymakers would be under a lot of pressure to create policy that looks good to the public, rather than policies that are actually good.

    • I’m very uncertain about how bad this effect is, but currently I think it could be pretty bad. It’s not hard to think of potential examples from environmentalism[10], e.g. moving from plastic bags to paper bags, banning plastic straws, emphasizing recycling, opposing nuclear power.

    • As an illustrative hypothetical example, recently there have been a lot of critiques of RLHF from the x-risk community. In a world where the public is anti-RLHF, perhaps policymakers impose very burdensome regulations on RLHF, and AGI labs turn to automated recursive self-improvement instead (which I expect would be much worse for safety).

  4. Controversy: AI x-risk would be controversial, since it’s very hard to reach a significant fraction of the public without being controversial[11] (see toxoplasma of rage, or just name major current issues with pressure from a large fraction of the public, and notice most are controversial). This seems bad[12].

    • People sometimes say controversy is inevitable, but I don’t see why this is true[13]. I haven’t seen compelling arguments for it, just raw assertions[14].

    • While I’m uncertain how negative this effect is overall, this brief analysis of cryonics and molecular nanotechnology is quite worrying; it almost reads like a forecast about the field of AI alignment.

  5. Deadlock: By making the issue highly salient and controversial, we may make governments unable to act when it otherwise would have acted (see Secret Congress).

I think that the best work on AI alignment happens at the AGI labs[15], for a variety of structural reasons[16]. As a result, I think disadvantage 2 (AGI labs Goodharting on public perception) is a really big deal and is sufficient by itself to put me overall against public advocacy with the aim of producing pressure for a pause. I am uncertain about the magnitudes of the other disadvantages, but they all appear to me to be potentially very large negative effects.

Current work on unconditional pauses

I previously intended to have a section critiquing the existing pause efforts, but recent events have made me more optimistic on this front. In particular, (1) I’m glad to see a protest against Meta (previous protests were at the most safety-conscious labs, differentially disadvantaging them), (2) I’m glad to see Holly acknowledge objections like overhangs (many of my previous experiences with unconditional pause advocates involved much more soldier mindset). I still don’t trust existing unconditional pause advocates to choose good strategies, or to try to come to accurate beliefs, and this still makes me meaningfully more uncomfortable about public advocacy than the prior section alone would, but I feel more optimistic about it than before.

Miscellaneous points

A few points that I think are less important than the others in this post:

  1. I have heard anecdotal stories (e.g. FINRA) that good regulation often comes from codifying existing practices at private companies. I’m currently more excited about this path though have not done even a shallow dive into the topic and so am very uncertain.

  2. I’m particularly worried about differentially disadvantaging safety-conscious AGI labs (including e.g. through national regulations that fail to constrain international labs).

  3. I think it’s not very tractable to get an unconditional pause, but I’m not particularly confident in that and it is not a crux for me[17].

  4. I suspect relative to most of my audience I’m pretty optimistic that enforcement of significant regulations is feasible, and it is mostly about whether we have the political will to do so.

Having now read the other posts this week, I wish I had written a slightly different post. In particular, I was not expecting nearly this much anti-moratorium sentiment; I would have also focused on arguing for a conditional pause rather than no pause. I also didn’t expect alignment optimism to be such a common theme. Rather than rewrite the post (which takes a lot of time), I decided to instead add some commentary on each of the other posts.

What’s in a Pause? and How could a moratorium fail?: I think the idea here is that we should eventually aim for an international regulatory regime that is not a pause, but a significant chunk of x-risk from misaligned AI is in the near future, so we should enact an unconditional pause right now. If so, my main disagreement is that I think the pause we enact right now should be conditional: specifically, I think it’s important that you evaluate the safety of a model after you train it, not before[18]. I may also disagree (perhaps controversially) that a significant chunk of x-risk from misaligned AI is in the near future, depending on what “near future” means.

AI Pause Will Likely Backfire: I weakly agree with the part of this post that argues that an unconditional pause would be bad, due to (1) overhangs /​ fast takeoff and (2) advantaging less cautious actors (given that a realistic pause would not be perfect). I think these don’t bite nearly as hard for conditional pauses, since they occur in the future when progress will be slower[19], they are meant to be temporary to give us time to improve our mitigations, and they are easier to build a broad base of support for (and so are easier to enforce across all relevant actors).

I have several disagreements with the rest, which argues for optimism about alignment, but I also don’t understand why this matters. My opinions on pausing wouldn’t change if I became significantly more optimistic or pessimistic about alignment, because the decision-relevant quantity is whether pausing lets you do better than you otherwise would have done[20].

Policy ideas for mitigating AI risk: I mostly agree with the outline of the strategic landscape, establishing the need to prevent catastrophe-capable AI systems from being built. Thomas then provides some concrete policy proposals, involving visibility into AI development and brakes to slow it down. These seem fine, but personally I think we should be more ambitious and aim to have a conditional pause[21].

How to think about slowing AI: I agree with this post; I think it says very similar things as I do in the “Which goals are good?” section.

Comments on Manheim’s “What’s in a Pause?”: There isn’t a central point in this post, so I will mostly not respond to it. The one thing I’ll note is that a lot of the post is dependent on the premise that if smarter-than-human AI is developed in the near future, then we almost surely die, regardless of who builds it. I strongly disagree with this premise, and so disagree with many of the implications drawn (e.g. that it approximately doesn’t matter if you differentially advantage the least responsible actors).

The possibility of an indefinite AI pause: I had two main comments on this post:

  1. I agree that you eventually need a “global police state” to keep an AI pause going indefinitely if you allow AI research and hardware improvements to continue. But you could instead ban those things as well, which seems a lot easier to do than to build a global police state.

  2. I don’t think it’s reasonable to worry that an indefinite AI pause would be an x-risk by stalling technological progress, given that there are many other non-AI technologies that can enable continued progress (human augmentation and whole brain emulation come to mind).

I did agree with Matthew’s point that the existing evidence for an AI catastrophe isn’t sufficient to justify creating a global police state.

The Case for AI Safety Advocacy to the Public: I agree with this post that advocacy can work for AI x-risk, and that the resulting public pressure could lead to more things being done[22]. I agree that conditional on advocacy to the public, you likely want the message to be some flavor of pause, since other messages would require too much nuance. I’m on board with shifting the burden of proof onto companies to show that their product is safe (but I’m not convinced that this implies a pause).

I agree that the inside-outside game dynamic (or good cop /​ bad cop as I call it) has worked in previous cases of advocacy to the public. However, I expect it only works in cases where the desirable policy is relatively clear (and not nuanced), unlike AI alignment, so I’m overall against.

A couple of points where I had strong disagreements:

  1. ““Pause AI” is a simple and clear ask that is hard to misinterpret in a harmful way” – this is drastically underestimating either people’s ability to misinterpret messages, or how much harm misinterpretations can cause.

  2. “I predict that advocacy activities could be a big morale boost” – I experience existing advocacy efforts as very demoralizing: people make wildly overconfident contrarian claims that I think are incorrect, they get lots of attention because of how contrarian the takes are, and they use this attention to talk about how the work I do isn’t helping.

AI is centralizing by default; let’s not make it worse: Half of this post is about how AI is easier to control than humans. I agree with the positive arguments outlining advantages we have in controlling AIs. But when addressing counterarguments, Quintin fails to bring up the one that seems most obvious and important to me, that AIs will become much more capable and intelligent than humans, which could make controlling them difficult. Personally, I think it’s still unclear whether AI will be easier or harder to control than humans.

The other half of the post is about centralization effects of AI, warning against extreme levels of centralization. I didn’t get a good picture about what extreme levels of centralization look like and why they are x-risky (are we talking totalitarianism?). I’m not sure whether the post is recommending against pauses because of centralizing effects, but if it is I probably disagree because I expect the other effects of a pause would be significantly more important.

We are not alone: many communities want to stop Big Tech from scaling unsafe AI: I agree with the title, but I think the mitigations that other communities would want would be quite different from the ones that would make sense for alignment.

  1. ^

    One might say that iterative deployment is no better than full steam ahead because mitigations will fail to generalize to higher capability levels, either because x-risks are qualitatively different or because mitigations tend to break once you exceed human-level. I disagree, but I won’t defend that here.

  2. ^

    Epoch’s trends dashboard currently estimates that the required compute drops by 2.5x = 0.4 OOM (order of magnitude) per year due to algorithmic progress, and the required money per unit compute drops by 1.32x = 0.1 OOM per year due to hardware progress, for a combined estimate of 3.3x = 0.5 OOM per year.

  3. ^

    Some people think of this as requiring extreme surveillance, but I don’t think that’s necessary: ultimately you need to prevent algorithmic progress, compute progress, and usage of large clusters of compute for big training runs; I expect you can get really far on (1) and (2) by banning research on AI and on compute efficiency, enforced by monitoring large groups of researchers (academia, industry, nonprofits, etc), as well as monitoring published research (in particular, we do not enforce it by perpetually spying on all individuals), and on (3) by monitoring large compute clusters. This doesn’t defend against situations where a small group of individuals can build AGI in their basement – but I assign very low probability to that scenario. It’s only once you can build AGI with small amounts of compute and publicly available techniques that you need extreme surveillance to significantly reduce the chance that AGI is built.

  4. ^

    I initially had an exception here for mechanistic interpretability: most current work isn’t done on the largest models, since the core problems to be solved (e.g. feature discovery, superposition, circuit identification) can be easily exhibited with small models. However, on reflection I think even mechanistic interpretability gets significant benefits. In particular, there is a decent chance that there are significant architecture changes on the way to powerful models; it is much better for mechanistic interpretability to know that earlier rather than later. See for example the fairly big differences in mechanistic interpretability for Transformers vs. CNNs.

  5. ^

    This is a bit subtle. I expect pauses to be longer in worlds where we have unconditional pauses at GPT-4 than in worlds where we have conditional pauses, because having an unconditional pause at GPT-4 is evidence of the world at large caring a lot about AI risk, and/​or being more conservative by default. However, these are evidential effects, and for decision-making we should ignore these. The more relevant point is: if you take a world where we “could have” implemented an unconditional pause at GPT-4, hold fixed the level of motivation to reduce AI risk, and instead implement a conditional pause that kicks in later, probably the conditional pause will last longer because the potential harms for GPT-5 will galvanize even more support than existed at the time of GPT-4.

  6. ^

    There are also safety techniques like anomaly detection on model usage that benefit from real-world usage data.

  7. ^

    This isn’t a great conditional pause proposal; we can and should choose better evaluations. My point is just that this is a concrete proposal that is more closely tied to danger than the unconditional pause proposal, while still having a really low chance of failing to pause before AI systems become x-risky, unless you believe in really fast takeoff.

  8. ^

    However, typically the “good cops” are still, well, cops: they are separate organizations with a mission to hold a responsible party to account. The Humane League is a separate organization that negotiates with Big Ag, not an animal welfare department within a Big Ag company. In contrast, with AI x-risk, many of the AGI labs have x-risk focused departments. It’s unclear whether this would continue in a good cop /​ bad cop routine.

  9. ^

    Both because the public understanding of AI x-risk will lack nuance and so will call for less good work, and because companies will want to protect their intellectual property (IP) and so would push for safety work on topics that can be more easily talked about.

  10. ^

    Note I don’t know much about environmentalism and haven’t vetted these examples; I wouldn’t be surprised if a couple turned out to be misleading or wrong.

  11. ^

    You can also get public pressure from small but vocal and active special interest groups. I expect this looks more like using the members of such groups to carry out targeted advocacy to policymakers, so I’d categorize this as (2) in my list of types of advocacy, and I’m broadly in support of it.

  12. ^

    See for example How valuable is movement growth?, though it is unfortunately not AI-specific.

  13. ^

    A few caveats: (a) Certainly some aspects of AI will be controversial, simply because AI has applications to already controversial areas, e.g. military uses. I’m saying that AI x-risk in particular doesn’t seem like it has to be controversial. (b) My prediction is that AI x-risk will be controversial, but the main reason for that is because parts of the AI x-risk community are intent on making strong, poorly-backed, controversial claims and then call for huge changes, instead of limiting themselves to the things that most people would find reasonable. In this piece, since I’m writing to that audience to influence their actions, I’m imagining that they stop doing those things as I wish they would do – if that were to happen, it seems quite plausible to me that AI x-risk wouldn’t become controversial.

  14. ^

    For example, Holly’s post: “AI is going to become politicized whether we get involved in it or not”.

  15. ^

    This is a controversial view, but I’d guess it’s a majority opinion amongst AI alignment researchers.

  16. ^

    Reasons include: access to the best alignment talent, availability of state of the art models, ability to work with AI systems deployed at scale, access to compute, ability to leverage existing engineering efforts, access to company plans and other confidential IP, access to advice from AI researchers at the frontier, job security, mature organizational processes, etc. Most of these reasons are fundamentally tied to the AGI labs and can’t easily be ported to EA nonprofits if the AGI labs start to Goodhart on public perception. Note there are countervailing considerations as well, e.g. the AGI labs have more organizational politics.

  17. ^

    As one piece of evidence, even a successful case like NEPA for environmentalism may have been the result of luck.

  18. ^

    If you require safety to be shown before you train the model, you are either imposing an unconditional pause, or you are getting the AGI companies to lie to you, both of which seem bad.

  19. ^

    We’re currently able to scale up investment in AI massively, because it was so low to begin with, but eventually we’ll run out of room to scale up investment.

  20. ^

    This does break down at extremes – if you are very confident that alignment will work out fine, then you might care more about getting the benefits of AGI sooner; if you are extremely confident that alignment won’t work, then you may want to aim for an unconditional permanent pause even though it isn’t tractable, and view an unconditional temporary pause as good progress.

  21. ^

    The key difference is that a conditional pause proposal would not involve “emergency” powers: the pause would be a default, expected result of capabilities scaling as models gain dangerous capabilities – until the lab can also demonstrate mitigations that ensure the models will not use those dangerous capabilities.

  22. ^

    Though I don’t trust polls as much as Holly seems to. For example, 32% of Americans say that animals should be given the same rights as people – but even self-reported veg*nism rates are much lower than that.