Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?

Note: I really appreciate the work that the OpenAI alignment team put into their alignment plan writeup and related posts, especially Jan Leike, the leader of that team. I believe open discussions about alignment approaches make it more likely that the whole community will be able to find flaws in their own plans and unappreciated insights, resulting in better alignment plans over time.

Summary: OpenAI’s alignment plan acknowledges several key challenges of aligning powerful AGI systems, and proposes several good ideas. However, the plan fails to sufficiently address:

The dual-use nature of AI research assistants and the high risk that such assistants will improve capabilities more than alignment research in ways that net-increase AI existential risk.
The likely challenges involved in both generating and evaluating AI alignment research using AI research assistants. It seems plausible that generating key insights about the alignment problem will not be possible before the development of dangerously powerful AGI systems.
The nature and difficulty of the alignment problem. There are substantial reasons why AI systems that pass all tests in development may not stay safe once able to act in the world. There are substantial risks from goal misgeneralization, including deceptive misalignment, made worse by potential rapid increases in capabilities that are hard to predict. Any good alignment plan should address these problems, especially since many of them may not be visible until an AI system already has dangerous capabilities.

The dual-use nature of AI research assistants and whether these systems will differentially improve capabilities and net-increase existential risk

There has been disagreement in the past about whether “alignment” and “capabilities” research are a dichotomy. Jan Leike has claimed that they are not always dichotomous, and this is important because lots of capabilities insights will be useful for alignment, so the picture is not as worrisome as a dichotomous picture might make it seem.

I agree with Jan that these alignment and capabilities research are not dichotomous, but in a way I think actually makes the problem worse, not better. Yes, it’s probable that some AI capabilities could help solve the alignment problem. However, the general problem is that unaligned AGI systems are far easier to build—they’re a far more natural thing to emerge from a powerful deep learning system than an aligned AGI system. So even though there may be deep learning capabilities that can help solve the alignment problem, most of these capabilities are still easier applied to making *any* AGI system, most of which are likely to be unaligned even when we’re trying really hard.^[1]

Let’s look at AI research assistants in particular. I say “AI research assistant” rather than “alignment research assistant” because I expect that it’s highly unlikely that we will find a way to build an assistant that is useful for alignment research but not useful for AI research in general. Let’s say OpenAI is able to train an AI research assistant that can help the alignment team tackle some difficult problems in interpretability. That’s great! However, a question is, can that model also help speed up AGI development at the rest of the company? If so, by how much? And will it be used to do so?

Given that building an aligned AGI is likely much harder than building an unaligned AGI system, it would be quite surprising if an AI research assistant was better at helping with AGI safety research differentially over AGI development research more broadly. Of course it’s possible that a research tool that sped up capabilities research more than alignment research could still be net positive. It depends on the nature and difficulties of the subproblems of alignment. From where I stand, I have a hard time imagining how this tool could come out positive. Nate Soares makes a point that there may be necessary pieces of alignment research that are very hard to speed up in terms of calendar time, even if other necessary parts can be sped up.

If it’s both true that AI research assistants would substantially help with alignment and also push capabilities into unsafe domains before the alignment problem is solved, by default, there are two potential paths:

Try to build an AI system that’s *differentially* better for alignment research compared to general capabilities research
Build one that’s differentially better for capabilities research, but only use it for alignment research. This could be good if executed well, but it would be the kind of plan that would take significant security and operational competence. If that is explicitly the plan, I’d like that to be stated. Otherwise, it seems likely that product incentives will push companies to use their powerful research tools for speeding up everything, not just alignment work.

Jan is not unaware of the dual use / differential-speeding-up-capabilities problem, but his response is confusing. In an March 2022 post, An MVP for alignment, Jan writes:

One of the main downsides of this approach is that it’s plausible that nearby in design space to an alignment MVP is a system that accelerates AI progress faster than alignment progress. In practice, most of the time spent on empirical alignment research is similar to time spent on ML research. This could mean that by the time our system makes significant alignment research contributions, ML research itself is already getting automated. It seems very likely to me that the future of ML research will end up looking that way anyway, and that this is mostly bottlenecked by the capabilities of our models. If that’s true, then working on an alignment MVP wouldn’t impact the overall rate of AI progress.

His answer to “But won’t an AI research assistant speed up AI progress more than alignment progress?” seems to be “yes it might, but that’s going to happen anyway so it’s fine”, without addressing what makes this fine at all. Sure, if we already have AI research assistants that are greatly pushing forward AI progress, we might as well try to use them for alignment. I don’t disagree there, but this is a strange response to the concern that the very tool OpenAI plans to use for alignment may hurt us more than help us.

I suppose Jan probably believes that even though AI research assistants will speed up AGI progress, they’ll somehow help us solve the alignment problem before anyone deploys unaligned AGI. If his views are consistent then he must believe that the alignment tax is very small or that OpenAI will have a large lead, plus the operational adequacy to prevent the org from developing AGI until an adequate alignment approach was discovered and implemented. This seems surprisingly optimistic.

The challenges involved in both generating and evaluating AI alignment research using AI research assistants

I broadly agree that using AI research assistants for alignment research is a good idea, so long as the risks from the development and use of such assistants are carefully managed. However, relying on AI research assistants as the primary driver of alignment research seems insufficient. In the plan, Jan claims that “evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance.” This might be true, but at present, evaluating alignment research is difficult.

Experts in the field do not seem to agree on which alignment approaches proposed so far, or which research undertaken so far, have proved most useful or might prove most useful in the future.^[2] There’s been lots of disagreement on which approaches are promising to date, e.g. on the value of RLHF to alignment, or whether ELK or similar alignment approaches are trying to tackle the right parts of the problem. I claim these disagreements are very common. Below, I list some examples of alignment research that some people consider to be significant. I expect different alignment researchers reading this list will have pretty different ideas about which research is promising (as you read them, consider checking which you think are promising, how confident you are, and how you predict others might agree or disagree):

Theoretical research

Corrigibility—https://intelligence.org/files/Corrigibility.pdf
Risks from learned optimization—https://arxiv.org/abs/1906.01820
Iterated Amplification—https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd
Logical induction—https://arxiv.org/abs/1609.03543
The shard theory of human values—https://www.alignmentforum.org/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values

Empirical research

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback—https://arxiv.org/abs/2204.05862
Improving alignment of dialogue agents via targeted human judgements—https://arxiv.org/abs/2209.14375
Aligning language models to follow instructions—https://openai.com/research/instruction-following
AI-written critiques help humans notice flaws—https://openai.com/research/critiques
Solving math word problems with process- and outcome-based feedback—https://arxiv.org/abs/2211.14275

I think one way that the OpenAI alignment team could improve their plan is to discuss what kinds of alignment research directions they expect will be broadly useful which AI research assistants might help generate, and the ways that they might evaluate it. (I liked this write-up by John Wentworth on which parts of this problem might be challenging). For example, I expect mechanistic interpretability research to be important, and it seems plausible to me that AI research assistants could help make big advances in interpretability. Advances here seem easier to evaluate than advances studying corrigibility, since many of the problems with the corrigibility of AI systems are likely to arise once these systems already possess fairly dangerous capabilities.

Another reason I am concerned about a plan that relies heavily on AI research assistant tools is that I expect there will not be much time between “AI systems are powerful enough to help with pre-paradigmatic fields of science” and “Superhuman AI systems exist”.

I think there is a pretty strong argument for the above claim. Solving the AI alignment likely involves reasoning well about the nature of human goals and values, understanding complex / frontier ML engineering domains, and generally learning how to navigate an adversarial optimization process. While some subproblems can probably be understood without all this general understanding, it seems like each of these pieces would be necessary to actually build aligned AGI systems. An AI research assistant that can help with all of these pieces seems very close to an AGI system itself, since these domains are quite rich and different from one another.

The nature and difficulty of the alignment problem

I think that the OpenAI alignment team is familiar with the core arguments about AI alignment failures, including deceptive misalignment concerns. Given this, I do not understand why they do not spend more time in their alignment plan laying out how they seek to address many of the extremely thorny problems they will inevitably face.

This worries me. I think it’s quite plausible that OpenAI will be able to solve most of the “alignment” problems that crop up in language models in a few years, such that systems appear aligned. I worry that in such a situation, OpenAI researchers would declare mission accomplished, and scale up their systems to AGI, at which point we all die from a system that appeared aligned but was in fact not. And this is an obvious failure because aligning present systems likely does not require solving the extremely thorny problems like detecting whether your system is deceptively misaligned.

Basically, the plan does not explain how their empirical alignment approach will avoid lethal failures. The plan says: “Our main goal is to push current alignment ideas as far as possible, and to understand and document precisely how they can succeed or why they will fail.”

But doesn’t get into how the team will evaluate when plans have failed. Crucially, the plan doesn’t discuss most of the key hypothesized lethal failure modes^[3] - where apparently-but-not-actually-aligned models seize power once able to. This seems important to include especially if the alignment plan is based on an empirical approach aimed at detecting where alignment approaches succeed or fail, because it’s quite plausible empirical detection of these failures will be extremely difficult!

I think such failures are empirically detectable with the right kind of interpretability tools. Tools we have not yet developed. I would feel a lot better about the OpenAI plan if they said “hey we realize that we currently don’t have tools to detect failures like these, but one of the main things we’re planning to do with the help from our AI research assistants is to develop interpretability tools that will help detect deceptive misalignment and in general help us understand goal misgeneralization much better.

The OpenAI alignment plans includes the claim that the Natural Language API “is a very useful environment for [their] alignment research” because “It provides [them] with a rich feedback loop about how well [their] alignment techniques actually work in the real world”.

The problem here is observing and training out the behavioral failure modes of language models today seems likely to have little relevance to aligning superhuman systems—not zero relevance, but this is probably not going to help with many of the core problems.^[4] We already know we can train models to exhibit various kinds of honest or helpful behavior. There is some value in showing how hard or easy it is, and how robust these behavioral patterns are, but it doesn’t solve the underlying hard problem of “how do you know that an AI system is actually aligned with you.”

What I like about OpenAI’s alignment plan

This post started with my high level take, which included a lot of criticisms. There are also a bunch of things about the alignment plan post that I quite liked. So in no particular order:

I liked that the plan describes the high level approach to aligning AGI (not just current models). Namely, “building and aligning a system that can make faster and better alignment research progress than humans can.” Even though I don’t think this is a good plan, I like that they spelled out what their high-level approach is!

In general, I think the plan did a good job pointing at a bunch of true things. I like that they explicitly named existential risks from AI systems as a concern. Any alignment plan needs to start here, and they started here. In the same vein, the plan did a good job recognizing that current AI systems like InstructGPT are far from aligned. They also recognized that OpenAI is currently lacking in interpretability researchers, and that this might be important for their plan.

As part of their core alignment approach, they mention: “evaluating alignment research is substantially easier than producing it” which seems true, even if evaluation is still quite difficult (as I argue above). However, it still seems true that if you could produce an AI system that could output many alignment plans and do nothing else, some of which were truly good, that would be a huge step forward in alignment. And assuming it wasn’t already vastly smarter than humans (and if it was it could probably just convince you to release it right away), then this would probably be largely safety positive.

They also acknowledged limitations of their plan & anticipated ways that people would disagree with them, e.g. they acknowledged that there are fundamental limitations in how well humans can provide training signals for trying to train aligned AI systems. They also acknowledged that major discontinuities between aligning current systems and AGI systems would make the approach substantially less likely to work. They also acknowledged that “the least capable models that can help with alignment research might already be too dangerous if not properly aligned”.

I also want to acknowledge that I didn’t address every counterpoint to the common arguments against the OpenAI approach that Jan includes here. I appreciated that Jan wrote these out and I would like to respond to them, but I ran out of time this week. I might add comments on these as follow-ups in the comments later.

The OpenAI alignment plan is outlined in this post and Jan Leike adds more context to his thinking about this plan and related ideas on his blog. If you want to dive deep into the plan, I recommend reading not just the original post but also the rest of these. Jan addresses a bunch of points not present in alignment plan post and also responds to common criticisms:

Thanks AW for helping me review this post! Mistakes are all his.

^
An additional and likely even worse problem, is that there are likely many necessary alignment insights / discoveries you would need to find before building aligned AGI that would also speed up / make easier your general AGI engineering research. Thus, some kinds of alignment research can make us less safe if these insights spread. Thus both capabilities-insights-that-help-with-alignment and alignment-insights-that-help-us-with-capabilities would differentially be helpful to create unsafe AGI compared to safe AGI.
^
Jan acknowledges that this is true, see this quote from this post: “One of the problems that conceptual alignment work has is that it’s unclear when progress is being made and by how much. The best proxy is “do other researchers think progress is being made” and that’s pretty flawed: the alignment research community largely disagrees about whether any conceptual piece constitutes real progress.”
But his answer, “iterate on real systems” fails to address the many of the core concerns that conceptual alignment is trying to address! You can’t just say “well conceptual research is hard to evaluate so we’ll just skip that part and do the easy thing” if the easy thing won’t actually solve the problems you need to solve!
^
Relevant piece from Yudkowky’s List of Lethalities: “#3 We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try.”
^
In particular, List of Lethalities points 10, 17, and 20 seem particularly lethal for Open AI’s current approach:

“10 You can’t train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions… Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you’re starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat. (Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, “being able to produce outputs that humans look at” is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)”
“17 More generally, a superproblem of ‘outer optimization doesn’t produce inner alignment’ is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over. This is a problem when you’re trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you. We don’t know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.”
“20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.”