Disclaimer: I recently started as an interpretability researcher at Anthropic, but I wrote this doc before starting, and it entirely represents my personal views not those of my employer
Intended audience: People who understand why you might think that AI Alignment is important, but want to understand what AI researchers actually do and why.
Epistemic status: My best guess.
Epistemic effort: About 70 hours into the full sequence, and feedback from over 30 people
Special thanks to Sydney von Arx and Ben Laurense for getting me to actually finish this, and to all of the many, many people who gave me feedback. This began as my capstone project in the first run of the AGI Safety Fellowship, organised by Richard Ngo and facilitated by Evan Hubinger—thanks a lot to them both!
Meta: This is a heavily abridged overview (4K words) of a longer doc (25K words) I’m writing, giving my birds-eye conceptualisation of the field of Alignment. This post should work as a standalone and accessible overview, without needing to read the full thing. I’ve been polishing and re-polishing the full doc for far too long, so I’m converting it into a sequence and I’m publishing this short summary now as an introductory post, and trying to get the rest done over Christmas. Each bolded and underlined section header is expanded into a full section in the full thing, and will be posted to the Alignment Forum. I find detailed feedback super motivating, so please let me know what you think works well and doesn’t!
Terminology note: There is a lot of disagreement bout what “intelligence”, “human-level”, “transformative” or AGI even means. For simplicity, I will use AGI as a catch-all term for ‘the kind of powerful AI that we care about’. If you find this unsatisfyingly vague, OpenPhil’s definition of Transformative AI is my favourite precise definition.
What needs to be done to make the development of AGI safe? This is the fundamental question of AI Alignment research, and there are many possible answers.
I’ve spent the past year trying to get into AI Alignment work, and broadly found it pretty confusing to get my head around what’s going on at first. Anecdotally, this is a common experience. The best way I’ve found of understanding the field is by understanding the different approaches to this question. In this post, I try to write up the most common schools of thought on this question, and break down the research that goes on according to which perspective it best fits
There are already some excellent overviews of the field: I particularly like Paul Christiano’s Breakdown and Rohin Shah’s literature review and interview. The thing I’m trying to do differently here is focus on the motivations behind the work. AI Alignment work is challenging and confusing because it involves reasoning about future risks from a technology we haven’t invented yet. Different researchers have a range of views on how to motivate their work, and this results in a wide range of work, from writing papers on decision theory to training large language models to summarise text. I find it easiest to understand this range of work by framing it as different ways to answer the same fundamental question.
My goal is for this post to be a good introductory resource for people who want to understand what Alignment researchers are actually doing today. I assume familiarity with a good introductory resource, eg Superintelligence, Human Compatible or Richard Ngo’s AGI Safety from First Principles, and that readers have a sense for what the problem is and why you might care about it. I begin with an overview of the most prominent research motivations and agendas. I then dig into each approach, and the work that stems from that view. I especially focus on the different threat models for how AGI leads to existential risk, and the different agendas for actually building safe AGI. In each section, I link to my favourite examples of work in each area, and the best places to read more. Finally, as another way to understand the high-level differences in research motivations, I discuss the different underlying beliefs about how AGI will go, which I’ll refer to as crucial considerations.
I broadly see there as being 5 main types of approach to Alignment research. I break this piece into five main sections analysing each approach.
Note: The space of Alignment research is quite messy, and it’s hard to find a categorisation that carves reality at the joints. As such, lots of work will fit into multiple parts of my categorisation.
Addressing threat models: We keep a specific threat model in mind for how AGI causes an existential catastrophe, and focus our work on things that we expect will help address the threat model.
Agendas to build safe AGI: Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. With an emphasis on understanding how to build AGI safely, rather than trying to do it as fast as possible.
Robustly good approaches: In the long-run AGI will clearly be important, but we’re highly uncertain about how we’ll get there and what, exactly, could go wrong. So let’s do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind. Interpretability work is a good example of this.
De-confusion: Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be to do some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused.
I consider the process of coming up with each of the research motivations outlined in this post to be examples of good de-confusion work
Field-building: One of the biggest factors in how much Alignment work gets done is how many researchers are working on it, so a major priority is building the field. This is especially valuable if you think we’re confused about what work needs to be done now, but will eventually have a clearer idea once we’re within a few years of AGI. When this happens, we want a large community of capable, influential and thoughtful people doing Alignment work.
This is less relevant to technical work than the previous sections. I include it because I both think that technical researchers are often best placed to do outreach and grow the field, and because an excellent way to grow the field is by doing high-quality work that other researchers are excited to build upon.
Within this framework, I find the addressing threat models and agendas to build safe AGI sections the most interesting and think they contain the most diversity of views, so I expand these into several specific models and agendas.
Addressing threat models
There are a range of different concrete threat models. Within this section, I focus on three threat models that I consider most prominent, and which most current research addresses.
Treacherous turns: We create an AGI that is pursuing large-scale end goals that differ from ours. This results in convergent instrumental goals: the agent is incentivised to do things such as preserve itself and gain power, because these will help it achieve its end goals. In particular, this incentivises the AGI to deceive us into thinking it is aligned until it has enough power to decisively take over and pursue its own arbitrary end goals, known as a treacherous turn. This is the classic case outlined in Nick Bostrom’s Superintelligence, and Eliezer Yudkowsky’s early writing.
Sub-Threat model: Inner Misalignment. A particularly compelling way this could happen is inner misalignment—the system is itself pursuing a goal, which may not have been the goal that we gave it.. This is notoriously confusing, so I’ll spend more time explaining this concept than the others. See Rob Miles’ video for a more in-depth summary.
A motivating analogy: Evolution is an optimization process that produced humans, but from the perspective of evolution, humans are misaligned. Evolution is an optimization process which selects for organisms that are good at reproducing themselves. This produced humans, who were themselves optimizers pursuing goals such as food, status, and pleasure. In the ancestral environment pursuing these goals meant humans were good at reproducing, but in the modern world these goals do not optimize for reproduction, eg we use birth control.
The core problem is that evolution was optimizing organisms for the objective of ‘how well do they survive and reproduce’, but was selecting them according to their performance in the ancestral environment. Reproduction is a hard problem, so it eventually produced organisms that were themselves optimizers pursuing goals. But because these goals just needed to lead to reproduction in the ancestral environment, these goals didn’t need to be the same as evolution’s objective. And now humans are in a different environment, the difference is clear, and this is an alignment failure
Analogously, we train neural networks with an objective in mind, but just select them according to their performance on the training data. For a sufficiently hard problem, the resulting network may be an optimizer pursuing a goal, but all we know is that the network’s goal has good performance on the training data, according to our goal. We have no guarantee that the network’s goal is the objective we had in mind, and so cannot resolve treacherous turns by setting the right training objective. The problem of aligning the network’s goal with the training objective is the inner alignment problem.
You get what you measure: The case given by Paul Christiano in What Failure Looks Like (Part 1):
To train current AI systems we need to give them simple and easy-to-measure reward functions. So, to achieve complex tasks, such as winning a video game, we often need to give them simple proxies, such as optimising score (which can go wrong…)
Extrapolating into the future, as AI systems become increasingly influential and are trained to solve complex tasks in the real world, we will need to give them easy-to-measure proxies to optimize. Something analogous to, in order to maximise human prosperity, telling them to optimize GDP
By definition, these proxies will not capture everything we value and will need to be adjusted over time. But in the long-run they may be locked-in, as AI systems become increasingly influential and an indispensable part of the global economic system. An example of partial lock-in is climate change, though the hidden costs of fossil fuels are now clear, they’re so ubiquitous and influential that society is struggling to transition away from them.
The phenomenon of ‘you get what you measure’ is already common today, but may be much more concerning for AGI for a range of reasons. For example: AI systems are a human incomprehensible black box, meaning it’s hard to notice problems with how they understand their proxies; and AI capabilities may progress very rapidly, making it far harder to regulate the systems, notice problems, or adjust the proxies
AI Influenced Coordination Failures: The case put forward by Andrew Critch, eg in What multipolar failure looks like. Many players get AGI around the same time. They now need to coordinate and cooperate with each other and the AGIs, but coordination is an extremely hard problem. We currently deal with this with a range of existing international norms and institutions, but a world with AGI will be sufficiently different that many of these will no longer apply, and we will leave our current stable equilibrium. This is such a different and complex world that things go wrong, and humans are caught in the cross-fire.
This is of relevance to technical researchers because there is research that may make cooperation in a world with many AGIs easier, eg interpretability work.
Further, the alignment problem is mostly conceived of as ensuring AGI will cooperate with its operator, rather than ensuring a world with many operators and AGIs can all cooperate with each other; a big conceptual shift
Note that this decomposition is entirely my personal take, and one I find useful for understanding existing research. For an alternate perspective and decomposition, see this recent survey of AI researcher threat models. They asked about five threat models (only half of which I cover here), and found that while opinions were often polarised, on average, the five models were rated as equally plausible.
Agendas to build safe AGI
There are a range of agendas proposed for how we might build safe AGI, though note that each agenda is far from a complete and concrete plan. I think of them more as a series of confusions to explore and assumptions to test, with the eventual goal of making a concrete plan. I focus on three agendas I consider most prominent—see Evan Hubinger’s Overview of 11 proposals for building safe advanced AI for more.
Iterated Distillation and Amplification (IDA): We start with a weak system, and repeatedly amplify it to a more capable but expensive to run system, and distill that amplified version down to one that’s cheaper to run.
This is a notoriously hard idea to explain well, so I spend more words on it than most other sections. Feel free to skip if you’re already familiar.
Motivation 1: We distinguish between narrow learning, where a system learns how to take certain actions, eg imitating another system, and ambitious learning, where a system is given a goal but may take arbitrary actions to achieve that goal. Narrow learning seems much easier to align because it won’t give us surprising ways to achieve a goal, but this also inherently limits the capabilities of our system. Can we achieve arbitrary capabilities only with narrow techniques?
Motivation 2: If a system is less capable than humans, we may be able to look at what it’s doing and understand it, and verify whether it is aligned. But it is much harder to scale this oversight to systems far more capable than us, as we lose the ability to understand what they’re doing. How can we verify the alignment of systems far more capable than us?
The core idea of IDA:
We want to build a system to perform a task, eg being a superhuman personal assistant.
We start with some baseline below human level, which we can ensure is aligned, eg imitating a human personal assistant.
We then Amplify this baseline, meaning we make a system that’s more expensive to run, but more capable. Eg, we give a human personal assistant many copies of this baseline, and the human can break tasks down into subtasks, and use copies of the system to solve them. Crucially in this example, as we have amplified the baseline by just making copies and giving them to a human, we should expect this to remain aligned.
We then Distill this amplified system, using a narrow technique to compress it down to a system that’s cheaper to run, though may not be as capable. Eg, we train a system to imitate the amplified baseline. As we are using a narrow technique, we expect this distilled system to be easy to align. And as the amplified baseline is more capable than the distilled system, we can use that to help ensure alignment, achieving scalable oversight.
We repeatedly amplify then distill. Each time we amplify, our capabilities increase, each time we distill they decrease, but overall they improve—we take two steps forward, then one step back. This means that by repeatedly applying narrow techniques, we could be able to achieve far higher capabilities.
Caveat: The idea I’ve described is a fairly specific form of IDA. The term is sometimes used to vaguely describe a large family of approaches that recursively break down a complex problem, using some analogue of Amplification and Distillation, and which ensure alignment at each step.
AI Safety via Debate: Our goal is to produce AI systems that will truthfully answer questions. To do this, we need to reward the system when it says true things during training. This is hard, because if the system is much smarter than us, we cannot distinguish between true answers and sophisticated deception. AI Safety via Debate solves this problem by having two AI systems debate each other, with a third (possibly human) system judging the debate. Assuming that the two debaters are evenly matched, and assuming that it is easier to argue for true propositions than false ones, we can expect the winning system to give us the true answer, even if both debaters are far more capable than the judge.
Solving Assistance Games: This is Stuart Russell’s agenda, which argues for a perspective shift in AI towards a more human-centric approach.
This views the fundamental problem of alignment as learning human values. These values are in the mind of the human operator, and need to be loaded into the agent. So the key thing to focus on is how the operator and agent interact during training.
In the current paradigm, the only interaction is the operator writing a reward function to capture their values. This is an incredibly limited approach, and the field needs a perspective shift to have training processes with much more human-agent interaction. Russell calls these new training processes assistance games.
Russell argues for a paradigm with 3 key features: we judge systems according to how well they optimise our goals, the systems are uncertain about what these goals are, and these are inferred from our behaviour.
The focus is on changing the perspective and ways of thinking in the field, rather than on specific technical details, but Russell has also worked on some specific implementations of these ideas, such as Cooperative Inverse Reinforcement Learning
Robustly good approaches
Rather than the careful sequence of logical thought underlying the two above categories, robustly good approaches are backed more by a deep and robust-feeling intuition. They are the cluster thinking to the earlier motivation’s sequence thinking. This means that the motivations tend to be less rigorous and harder to clearly analyse, but are less vulnerable to identifying a single weak point in a crucial underlying belief. Instead there are lots of rough arguments all pointing in the direction of the area being useful. Often multiple researchers may agree on how to push forwards on these approaches, while having wildly different motivations. I focus on the 3 key areas of interpretability, robustness and forecasting.
Note that robustly good does not mean that ‘there is no way this agenda is unhelpful’, it’s just a rough heuristic that there are lots of arguments for the approach being net good. It’s entirely possible that the downsides in fact outweigh the upsides.
(Conflict of interest: Note that I recently started work on interpretability under Chris Olah, and many of the researchers behind scaling laws are now at Anthropic. I formed the views in this section before I started work there, and they entirely represent my personal opinion not those of my employer or colleagues)
Interpretability: The key idea of interpretability is to demystify the black box of a neural network and better understand what’s going on inside. This often rests on the implicit assumption that a network can be understood. I focus on mechanistic interpretability, which focuses on finding the right tools and conceptual frameworks to interpret a network’s parameters.
I consider Chris Olah’s Circuits Agenda to be one of the most ambitious and exciting efforts here. It seeks to break a network down into understandable pieces, connected together via human-comprehensible algorithms implemented by the parameters. This has produced insights such as neurons in image networks often encoding comprehensible features, or reverse engineering the network’s parameters to extract the algorithm used to detect curves.
The key intuition for why to care about this is that many risks are downstream of us not fully understanding the capabilities and limitations of our systems, and this leading to unwise and hasty deployment. Particular reasons I find striking:
This may allow a line of attack on inner alignment—training a network is essentially searching for parameters with good performance. If many sets of parameters have good performance, then the only way to notice subtler differences is via interpretability
Understanding systems better may allow better coordination and cooperation between actors with different powerful AIs
It may allow a saving throw to detect misalignment before deploying a dangerous system in the real world
We may better understand concrete examples of misaligned systems gaining insight to be used to align them and better understand the problem.
This case is laid out more fully in Chris Olah’s Views on AGI Safety.
Robustness: The study of systems that generalise nicely off of the distribution of data it was trained on without catastrophically breaking. Adversarial examples are a classic example of this—where image networks detect subtle textures of an image that are imperceptible to a human. By changing these subtle textures, networks become highly confident that an image is eg a gibbon, while a human thinks it looks like a panda. More generally, robustness is a large subfield of modern Machine Learning, focused on questions of ensuring systems fail gracefully on unfamiliar data, can give appropriate confidences and uncertainties on difficult data points, are robust to adversaries, etc.
Why care? Fundamentally, many concerns about AI misalignment are forms of accident risk. The operators are not malicious, so if a disaster happens it is likely because the system did well in training but acted unexpectedly badly when deployed in the real world. The operators aren’t trying to cause extinction! The real world is a different distribution of data than training data, and so this is fundamentally a failure of generalisation. And better understanding these failures seems valuable.
Eg, deception during training that is stopped once the AI is no longer under our control is an example of (very) poor generalisation
Eg, systemic risks such as all self-driving cars in a city failing all at once
Eg, systems failing to sensibly during unprecedented world events, eg a self-driving car not coping with snow in Texas, or a personal assistant AI scheduling in-person appointments during a pandemic
Dan Hendrycks makes the case for the importance of robustness (and other subfields of ML)
Forecasting: A key question when thinking about AI Alignment is timelines—how long until we produce human-level AI? If the answer is, say, over 50 years, the problem is far less urgent and high-priority than if it’s 20. On a more granular level, with forecasting we might seek to understand what capabilities to expect when, which approaches might scale to AGI and which will hit a wall, which capabilities will see continuous growth vs a discontinuous jump, etc.
In my opinion, some of the most exciting work here is scaling laws, which take a high-energy physics style approach to systematically studying large language models. These have found that scale is a major driver in model performance, and further that this follows smooth and predictable laws, as we might expect from a natural process in physics.
The loss can be smoothly extrapolated from our current models, and seems to be driven by power laws in the available data, compute and model size
These extrapolations have been confirmed by later models such as GPT-3, and so have made genuine predictions rather than overfitting to existing data.
Ajeya Cotra has extended this research to estimate timelines until our models scale to the capabilities of the human brain.
The case for this is fairly simple—if we better understand how long we have until AGI and what the path there might look like, we are far better placed to tackle the ambitious task of doing useful work now to influence a future technology.
This may be decision relevant, eg a 10 year plan to go into academia and become a professor makes far more sense with long timelines, while doing directly useful work in industry now may make more sense with short timelines
If we understand which methods will and will not scale to AGI, we may better prioritise our efforts towards aligning the most relevant current systems.
The point of this post is to help you gain traction on what different alignment researchers are doing and what they believe. Beyond focusing on research motivations, another way I’ve found valuable to get insight is to focus on key considerations—underlying beliefs about AI that often generate the high-level differences in motivation and agendas. So in the sixth and final section I focus on these. There are many possible crucial considerations, but I discuss four that seem to be the biggest generators of action-relevant disagreement:
Prosaic AI Alignment: To build AGI, we will need to have a bunch of key insights. But perhaps we have already found all the key insights. If so, AGI will likely look like our current systems but better and more powerful. And we should focus on aligning our current most powerful systems and other empirical work. Alternately, maybe we’re missing some fundamental insights and paradigm shifts. If so, we should focus more on robustly good approaches, field-building, conceptual work, etc.
Sharpness of takeoff: Will the capabilities of our systems smoothly increase between now and AGI? Or will we be taken by surprise, by a discontinuous jump in capabilities? The alignment problem seems much harder in the second case, and we are much less likely to get warning shots—examples of major alignment failures in systems that are too weak to successfully cause a catastrophe.
Timelines: How long will it be until we have AGI? Work such as de-confusion and field-building look much better on longer timelines, empirical work may look better on shorter timelines, and if your timelines are long enough you probably don’t prioritise AI Alignment work at all.
How hard is alignment?: How difficult is it going to be to align AGI? Is there a good chance that we’re safe by default, and just need to make that more robustly likely? Or are most outcomes pretty terrible, and we likely need to slow down and radically rethink our approaches?
Regarding future posts in the sequence:
The hope is that this introduction will serve as an accessible and standalone overview of the field, and allows me to get feedback on my breakdown, while providing more urgency on publishing the full sequence. I expect to work on the full thing over Christmas, and expect to publish each section as it’s ready as a further post in a sequence on the Alignment Forum. Each section header that is bolded and underlined will be significantly expanded—I will link from here to posts in the sequence when they’re done. Note: The sequence is not linear, you can read the posts in any order, according to your interests
My main intended contribution is to break down the field of alignment into different research agendas, and to analyse the motivations and theories of change behind them, and to give a lens to analyse the field for someone new and overwhelmed by what’s going on. Please give any feedback you have on ways I do and do not succeed at this, and ways this could have been more useful to you!
You can read a draft of the full sequence here.