[Epistemic status: some fairly rough thoughts on alignment strategy. I am moderately confident (60%) that these claims are largely correct.]
As of today, most alignment research is directed at solving the problem of aligning AI systems through either:
Using AI to supervise and evaluate superhuman systems (scalable oversight).
Trying to understand what neural nets are doing internally (mechanistic interpretability).
Fundamental research on alignment theory.
Ultimately, most research is trying to answer the question “how do we align AI systems?”[1]. In the short term however, I claim that a potentially more important research direction is work to answer the question “how hard is (technical) alignment?”.
In what follows, I will take estimating the following quantity as a suitable proxy[2] for “how hard is alignment?”, and will refer to this quantity as the alignment tax[3]:
“number of years it would take humanity to build aligned AGI in a world where we do not build AGI until we are >95% confident it is safe” / “number of years it would take humanity to build AGI in the world as it exists today”
Claim 1: there is currently significant uncertainty over the difficulty of the alignment problem, and much of that uncertainty is Knightian
Between the AI capabilities and alignment communities, (implied) estimates for this quantity differ significantly:
A subset of the ML community (albeit a decreasing subset) largely dismisses alignment as a concern or believe it to be essentially trivial, corresponding to a tax of roughly 1.
Frontier labs (such as Deepmind, OpenAI) appear to be firmly in the “real but tractable” range, which I’d map to a tax somewhere around 1.1-1.5.
Many researchers at institutions such as MIRI appear to place significant credence on a tax >=2, with a fat tail.
This uncertainty largely stems from different researchers having different models through which they view the alignment problem, with fairly little direct empirical support for any specific model. I interpret Eliezer’s model on the difficulty of alignment as stemming from a mixture of heuristics on the failure of gradient descent to learn policies where alignment capabilities generalise reliably out of distribution along with heuristics regarding the extent to which progress across many domains and at many levels is discontinuous, with these heuristics heavily influenced by the example of natural selection. I see the latter point re discontinuity as being one of the main disagreements between him and e.g. Paul Christiano.
The key point is that we do not currently know which world we are in. A priori, it is not reasonable to rule out an alignment tax of >=2. Moreover, discussion on alignment has historically neglected the degree to which we are uncertain, with frontier labs (an exception being Anthropic) proposing approaches to AGI development that largely ignore the possibility we are in worlds where alignment is hard (I extend this criticism to both the subset of the ML community that predicts alignment being trivial, and also to doomers like Eliezer who’s confidence seems unjustifiable under reasonable Knightian norms).
Claim 2: reducing this Knightian uncertainty may be significantly easier than solving alignment, even in worlds where alignment is hard
Previous efforts to identify cruxes or testable predictions between models that predict alignment being hard vs alignment being tractable have not been particularly successful.
Despite this, a priori I expect that attempts to reduce this uncertainty will have a reasonable chance of success. In particular, in worlds where alignment is hard I only see the difficulty of alignment as very weakly correlated with identifying that alignment is hard. While many of the heuristics that predict alignment being easy or hard necessarily reason about minds that are at least as intelligent as humans (and these intuitions are informed by analogies to human evolution), my prior is that with work to sufficiently flesh out and formalise these heuristics it will be possible to make testable predictions about specific sub-human systems.
Moreover, in worlds where alignment is extremely hard I expect that it will be easier to find stronger formalisms of such heuristics[4], and as such it will either be possible to prove stronger results in fundamental alignment theory, or easier to demonstrate specific failure modes empirically.
Importantly, this claim is distinct from the claim that it will be easy to reduce uncertainty through empirical evidence on whether a given alignment solution is sufficient. I find it plausible that for a given alignment solution (e.g. more advanced versions of RLHF) failure will only appear close to the subhuman/superhuman boundary – indeed, this is a large component of why alignment may be hard – because alignment techniques will likely use human input (such as human feedback) which will strongly incentivise good behaviour at the subhuman level. A priori, the underlying reasons for misalignment – failure to generalise out of distribution, sharp left turns/ phase changes, deception etc. – are failure modes that may appear in toy subhuman systems. It is by understanding and observing whether these proposed failure modes are real that we may be able to gather empirical evidence on the difficulty of alignment.
Claim 3: research in this direction may look different to existing research
Most research today is not directly answering this question, although much research today does make progress in this direction:
Interpretability and scalable oversight may make it easier to identify specific failure modes empirically, which in turn may inform our view of the difficulty of alignment.
Progress on alignment theory may be useful for formalising specific intuitions.
Approaches such as ARC evals may also empirically demonstrate failure modes.
However, I fundamentally see existing approaches as not working to directly attack the question of how hard alignment is (which isn’t surprising, since they are not trying to answer this question).
At a high level, I see research in this direction as looking like a mixture of:
Formalising heuristics that predict different difficulties of the alignment problem so testable, empirical predictions can be made. Ultimately, the goal would be to identify as many cruxes as possible between models that make differing predictions regarding the difficulty of alignment.[5]
Demonstrating specific failure modes, such as trying to construct gradient hackers; or demonstrating sub-human systems that display increasingly egregious forms of mesa-optimisation (e.g. deceptive alignment); or demonstrating sharp left turns in toy models. More generally, I see this work as largely attempting to increase the evidence base that informs heuristics regarding the difficulty of alignment to more than just natural selection.
Claim 4: resolving this question has significant strategic implications
Progress on this question has two strategic implications:
1. It informs what humanity should do:
In worlds where alignment is easy and there is common knowledge that we have a high (>95%) chance of succeeding, there is a strong argument to continue current attempts to build AGI given the baseline probability of other x-risks over the next century.
In worlds where we know with high confidence that alignment is hard, existing approaches are not just unlikely to succeed – instead they are suicidal. Knowing that alignment is intractable over the next ~50 years would completely shift the set of feasible governance options – in these worlds, despite the difficulty of global coordination to stop the development of AGI, it would be the only remaining option. We would be able to eliminate investing resources/political capital into approaches that only stall[6] AGI development from the set of policy options under consideration, and instead know that we need to aim for actions that halt AGI development long term – actions that are too risky to aim for today given uncertainty over whether they are necessary.
2. It may provide legible evidence that alignment is hard:
In worlds where alignment is hard, the only solution is to coordinate.
This is unlikely to be easy, and a major obstacle to succeeding is likely to be being able to produce legible evidence that ensures major players – frontier labs, governments and regulators – both know, and have common knowledge (common knowledge plausibly being important to prevent race dynamics), that alignment is hard. Optimising to demonstrate convincing evidence of the difficulty of alignment is not the same as optimising to actually understand the difficulty of alignment, however I expect them to be sufficiently correlated that progress on the latter will be extremely important for the former.
Claim 5: research on this question is among the most important research directions right now
In most worlds where alignment is not hard, I expect us to have a reasonable probability of avoiding catastrophe due to misalignment (although I am less optimistic regarding misuse etc.). I expect in these worlds, through a mixture of public/ government pressure, warning shots and concern of those working in labs, that alignment research will be sufficiently incentivised such that labs do extensive work on aligning subhuman systems. Absent failure modes that would make alignment hard (such as sharp left turns, deception in inner alignment etc.) I’m much more optimistic that the work we will do on subhuman systems will generalise to superhuman systems. In these worlds, knowing that alignment is not hard is still valuable. Knowing this would allow us to better allocate resources between different alignment agendas, proceed with confidence that such alignment techniques are likely to work, and allow us to focus policy attention on the problem of safely evaluating and deploying frontier models.
In worlds where alignment is hard, the default outcome I expect is failure. I expect us to train models that only appear aligned, in particular through explicitly optimising against legible failure[7]. Explicitly, if we are in a world where alignment is hard and timelines are short, the top priority isn’t to solve alignment – we don’t have time! It’s to discover as quickly as possible that we are in this world, and to convince everyone else that we are.
- ^
With some exceptions, such as evals which try to answer “how do we determine whether a model is safe?”.
- ^
I don’t consider the specific operationalization of the question particularly important.
- ^
The alignment tax is also sometimes used to refer to the performance cost of aligning a system.
- ^
In worlds where almost all attempts to build aligned AGI converge on specific failure modes, I expect it will be easier to prove this or demonstrate this empirically, compared to worlds in which failure is highly dependent on the alignment technique used.
- ^
Examples of such research would include things such as progress on selection theorems; or hypothesising/ proving conditions under which sharp left turns are likely to happen. The aim here isn’t to be completely rigorous, but rather sufficiently formal to make predictions.
- ^
Such policies may still be instrumentally useful, such as by widening the Overton window and building political support for more aggressive action later on.
- ^
For example by training models until our interpretability tools cannot detect visible misalignment—and then losing control following deployment.
See also Anthropic’s view on this
The implicit strat (which Olah may not endorse) is to try to solve easy bits, then move on to harder bits, then note the rate you are progressing at and get a sense of how hard things are that way.
This would be fine if we could be sure we actually were solving the problems, and also not fooling ourselves about the current difficulty level, and if the relevant research landscape is smooth and not blockable by a single missing piece.
I agree the implicit strat here doesn’t seem like it’ll make progress on knowing whether the hard problems are real. Lots of the hard problems (generalising well ood, existence of sharp left turns) just don’t seem very related to the easier problems (like making LLMs say nice things), and unless you’re explicitly looking for evidence of hard problems I think you’ll be able to solve the easier problems in ways that won’t generalise (e.g. hammering LLMs with enough human supervision in ways that aren’t scalable, but are sufficient to ‘align’ it).
I strongly agree with you and, as someone who thinks alignment is extremely hard (not just because of the technical side of the problem, but due the human values side too), I believe that a hard pause and studying how to augment human intelligence is actually the best strategy.