Defining alignment research

I think that the concept of “alignment research” (and the distinction between that and “capabilities research”) is currently a fairly confused one. In this post I’ll describe some of the problems with how people typically think about these terms, and offer replacement definitions.

“Alignment” and “capabilities” are primarily properties of AIs not of AI research

The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community’s focus on general capabilities, and now it’s much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI.

Insofar as the ML community has thought about alignment, it has mostly focused on aligning models’ behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans.

However, extending “alignment” and “capabilities” from properties of AIs to properties of different types of research is a fraught endeavor. It’s tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems.

Firstly, in general it’s very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin’s study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways.

More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include:

RLHF makes language models more obedient, but also more capable of coherently carrying out tasks.
Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly. E.g. this paper finds that “LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as ‘flawless’”.
Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via activation steering).
Techniques for Eliciting Latent Knowledge will plausibly be important for allowing AIs to make better use of implicit knowledge (e.g. knowledge about protein folding that’s currently hidden inside AlphaFold’s weights).
MIRI thought that their agent foundations research could potentially advance AI capabilities, which motivated their secrecy about it.

Secondly, not only is it difficult to predict the effects of any given piece of research, it’s also difficult for alignment researchers to agree on which effects are good or bad. This is because there are deep disagreements in the field about the likelihoods of different threat models. The more difficult you think the alignment problem is, the more likely you are to consider most existing research useless or actively harmful (as Yudkowsky does). By contrast, Christiano has written in defense of research developing RLHF and language model agents; and many alignment researchers who are more closely-linked to mainstream ML have even broader views of what research is valuable from an alignment perspective.

Thirdly, for people concerned about AGI, most ML research should count as neither alignment nor capabilities—because it focuses on improving model performance in relatively narrow domains, in a way that is unlikely to generalize very far. I’ll call this type of research applications research. The ubiquity of applications research (e.g. across the many companies that are primarily focused on building ML products) makes some statistics that have been thrown around about the relative numbers of alignment researchers versus capabilities researchers (e.g. here, here) very misleading.

What types of research are valuable for preventing misalignment?

If we’d like to help prevent existential risk from misaligned AGI, but can’t categorize research on a case-by-case basis, we’ll need to fall back on higher-level principles about which research is beneficial. Specifically, I’ll defend two traits which I think should be our main criteria for prioritizing research from an alignment-focused perspective:

Valuable property 1: worst-case focus

Most ML research focuses on improving the average performance of models (whether on a narrow set of tasks or a broad range of tasks). By contrast, alignment researchers are primarily interested in preventing models’ worst-case misbehavior, which may arise very rarely (and primarily in situations where models expect it won’t be detected). While there’s sometimes overlap between work on improving average-case behavior and work on improving worst-case behavior, in general we should expect them to look fairly different.

We see a similar dynamic play out in cybersecurity (as highlighted by Yudkowsky’s writing on security mindset). In an ideal world, we could classify most software engineering work as cybersecurity work, because security would be built into the design by default. But in practice, creating highly secure systems requires a different skillset from regular software engineering, and typically doesn’t happen unless it’s some team’s main priority. Similarly, even if highly principled capabilities research would in theory help address alignment problems, in practice there’s a lot of pressure to trade off worst-case performance for average-case performance.

These pressures are exacerbated by the difficulty of addressing worst-case misbehavior even in current models. Its rarity makes it hard to characterize or study. Adversarial methods (like red-teaming) can find some examples, but these methods are bottlenecked by the capabilities of the adversary. A more principled approach would involve formally verifying safety properties, but formally specifying or verifying non-trivial properties would require significant research breakthroughs. The extent to which techniques for eliciting and addressing worst-case misbehavior of existing models will be helpful for more capable models is an open question.

Valuable property 2: scientific approach

Here’s another framing: a core barrier to aligning AGI is that we don’t understand neural networks well enough to say many meaningful things about how they function. So we should support research that helps us understand deep learning in a principled way. We can view this as a distinction between science and engineering: engineering aims primarily to make things work, science aims primarily to understand how they work. (This is related to Nate Soares’ point in this post.)

Thinking of AGI alignment as being driven by fundamental science highlights that big breakthroughs are likely to be relatively simple and easy to recognize—more like new theories of physics that make precise, powerful predictions than a big complicated codebase that we need to scrutinize line-by-line. This makes me optimistic about automating it in a way that humans can verify.

However, “trying to scientifically understand deep learning” is too broad a criterion to serve as a proxy for whether research will be valuable from an alignment perspective. For example, I expect that most work on scientifically understanding optimizers will primarily be useful for designing better optimizers, rather than understanding the models that result from the optimization process. So can we be more precise about what aspect of a “scientific approach” is valuable for alignment? My contention: a good proxy is the extent to which the research focuses on understanding cognition rather than behavior—i.e. the extent to which it takes a cognitivist approach rather than a behaviorist approach.

Some background on this terminology: the distinction between behaviorism and cognitivism comes from the history of psychology. In the mid-1900s, behaviorists held that the internal mental states of humans and animals couldn’t be studied scientifically, and therefore that the only scientifically meaningful approach was to focus on describing patterns of behavior. The influential behaviorist B. F. Skinner experimented with rewarding and punishing animal behavior, which eventually led to the modern field of reinforcement learning. However, the philosophical commitments of behaviorism became increasingly untenable. In the field of ethology, which studies animal behavior, researchers like Jane Goodall and Frans de Waal uncovered sophisticated behaviors inconsistent with viewing animals as pure reinforcement learners. In linguistics, Chomsky wrote a scathing critique of Skinner’s 1957 book Verbal Behavior. Skinner characterized language as a set of stimulus-response patterns, but Chomsky argued that this couldn’t account for human generalization to a very wide range of novel sentences. Eventually, psychology moved towards a synthesis in which study of behavior was paired with study of cognition.

ML today is analogous to psychology in the 1950s. Most ML researchers are behaviorists with respect to studying AIs. They focus on how training data determines behavior, and assume that AI behavior is driven by “bundles of heuristics” except in the cases where it’s demonstrated otherwise. This makes sense on narrow tasks, where it’s possible to categorize and study different types of behavior. But when models display consistent types of behavior across many different tasks, it becomes increasingly difficult to predict that behavior without reference to the underlying cognition going on inside the models. (Indeed, this observation can be seen as the core of the alignment problem: we can’t deduce internal motivations from external behavior.)

ML researchers often shy away from studying model cognition, because the methodology involved is often less transparent and less reproducible than simply studying behavior. This is analogous to how early ethologists who studied animals “in the wild” were disparaged for using unrigorous qualitative methodologies. However, they gradually collected many examples of sophisticated behavior (including tool use, power struggles, and cultural transmission) which eventually provided much more insight than narrow, controlled experiments performed in labs.

Similarly, I expect that the study of the internal representations of neural networks will gradually accumulate more and more interesting data points, spark more and more concrete hypotheses, and eventually provide us with principles for understanding neural networks’ real-world behavior that are powerful enough to generalize even to very intelligent agents in very novel situations.

A better definition of alignment research

We can combine the two criteria above to give us a two-dimensional categorization of different types of AI research. In the table below, I give central examples of each type (with the pessimistic-case category being an intermediate step between average-case and worst-case):

	Average-case	Pessimistic-case	Worst-case
Engineering	Scaling	RLHF	Adversarial robustness
Behaviorist science	Optimization science	Scalable oversight	AI control
Cognitivist science	Concept-based interpretability	Mechanistic interpretability	Agent foundations

There’s obviously a lot of research that I’ve skipped over, and exactly where each subfield should be placed is inherently subjective. (For example, there’s been a lot of rigorous scientific research on adversarial examples, but in practice it seems like the best mitigations are fairly hacky and unprincipled, which is why I put adversarial robustness in the “engineering” row.) But I nevertheless think that these are valuable categories for organizing our thinking.

We could just leave it here. But in practice, I don’t think people will stop talking about “alignment research” and “capabilities research”, and I’d like to have some definition of each that doesn’t feel misguided. So, going forward, I’ll define alignment research and capabilities research as research that’s close to the bottom-right and top-left corners respectively. This defines a spectrum between them; I’d like more researchers to move towards the alignment end of the spectrum. (Though recall, as I noted above, that most ML research is neither alignment nor capabilities research, but instead applications research.)

Lastly, I expect all of this to change over time. For example, the central examples of adversarial attacks used to be cases where image models gave bad classifications. Now we also have many examples of jailbreaks which make language models ignore developers’ instructions. In the future, central examples of adversarial attacks will be ones which make models actively try to cause harmful outcomes. So I hope that eventually different research directions near the bottom-right corner of the table above will unify into a rigorous science studying artificial values. And further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds. For now, though, when I promote alignment, it means that I’m trying to advance worst-case and/or cognitivist AI research.