Summary

We don’t have good proposals for alignment targets: The most recently published version of Coherent Extrapolated Volition (CEV), a fairly prominent alignment target, is Parliamentarian CEV (PCEV). PCEV gives a lot of extra influence to anyone who intrinsically values hurting other individuals (search the CEV arbital page for ADDED 2023 for Yudkowsky’s description of the issue). This feature went unnoticed for many years and would make a successfully implemented PCEV very dangerous.
Bad alignment target proposals are dangerous: There is no particular reason to think that discovery of this problem was inevitable. It went undetected for many years. There are also plausible paths along which PCEV (or a proposal with a similar issue) might have ended up being implemented. In other words: PCEV posed a serious risk. That risk has probably been mostly removed by the arbital update. (It seems unlikely that someone would implement a proposed alignment target without at least reading the basic texts describing the proposal). PCEV is however not the only dangerous alignment target, and risks from scenarios where someone successfully hits some other bad alignment target remains.
Alignment Target Analysis (ATA) can reduce these risks. We will argue that more ATA is needed and urgent. ATA can informally be described as analyzing and critiquing Sovereign AI proposals, for example along the lines of CEV. By Sovereign AI we mean a clever and powerful AI that will act autonomously in the world (as opposed to tool AIs or a pivotal act AI of the type that follows human orders and that can be used to shut down competing AI projects). ATA asks what would happen if a Sovereign AI project were to succeed at aligning their AI to a given alignment target.
ATA is urgent. The majority of this post will focus on arguing that ATA cannot be deferred. A potential Pivotal Act AI (PAAI) might fail to buy enough calendar time for ATA since it seems plausible that a PAAI wouldn’t be able to sufficiently reduce internal time pressure. Augmented humans and AI assistants might fail to buy enough subjective time for ATA. Augmented humans that are good at hitting alignment targets will not necessarily be good at analyzing alignment targets. Creating sufficiently helpful AI assistants might be hard or impossible without already accidentally locking in an alignment target to some extent.
ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.

A note on authorship

This text largely builds on previous posts by Thomas Cederborg. Chi contributed mostly by trying to make the text more coherent. While the post writes a lot in the “we” perspective, Chi hasn’t thought deeply about many of the points in this post yet, isn’t sure what she would endorse on reflection, and disagrees with some of the details. Her main motivation was to make Thomas’ ideas more accessible to people.

Alignment Target Analysis is important

This post is concerned with proposals for what to align a powerful and autonomous AI to, for example proposals along the lines of Coherent Extrapolated Volition (CEV). By powerful and autonomous we mean AIs that are not directly being controlled by a human or a group of humans but not the types of proposed AI systems that some group of humans might use for limited tasks, such as shutting down competing AI projects. We will refer to this type of AI as Sovereign AI throughout this post. Many people both consider it possible that such an AI will exist at some point, and further think that it matters what goal such an AI would have. This post is addressed to such an audience. (The imagined reader does not necessarily think that creating a Sovereign AI is a good idea. Just that it is possible. And that if it happens, then it matters what goal such an AI has).

A natural conclusion from this is that we need Alignment Target Analysis (ATA) at some point. A straightforward way of doing ATA is to take a proposal for something we should align an AI to (for example: the CEV of a particular set of people) and then ask what would happen if someone were to successfully hit this alignment target. We think this kind of work is very important. Let’s illustrate this with an example.

The most recently published version of CEV is based on extrapolated delegates negotiating in a Parliament. Let’s refer to this version of CEV as Parliamentarian CEV (PCEV). It turns out that the proposed negotiation rules of the Parliament gives a very large advantage to individuals that intrinsically values hurting other individuals. People that want to inflict serious harm get a bigger advantage than people that want to inflict less serious harm. The largest possible advantage goes to any group that wants PCEV to hurt everyone else as much as possible. This feature of PCEV makes PCEV very dangerous. However, this feature went unnoticed for many years, despite this being a fairly prominent proposal. This illustrates three things:

First, it shows that noticing problems with proposed alignment targets is difficult.
Second, it shows that successfully implementing a bad alignment target can result in a very bad outcome.
Third, it shows that reducing the probability of such scenarios is feasible (the fact that this feature has been noticed makes it a lot less likely that PCEV will end up getting implemented).

This example shows that getting the alignment target right is extremely important and that even reasonable seeming targets can be catastrophically bad. The flaws in PCEV’s negotiation rules are also not unique to PCEV. An AI proposal from 2023 uses similar rules and hence suffers from related problems. The reason that more ATA is needed is that finding the right target is surprisingly difficult, noticing flaws is surprisingly difficult, and because targets that look reasonable enough might lead to catastrophic outcomes.

The present post argues that making progress on ATA is urgent. As shown, the risks associated with scenarios where someone successfully hits a bad alignment target are serious. Our main thesis is that there might not be time to do ATA later. If one considers it possible that a Sovereign AI might be built, then the position that doing ATA now is not needed must rest on some form of positive argument. One class of such arguments is based on an assertion that ATA has already been solved. We already argued that this is not the case.

Another class of arguments is based on an assertion that all realistic futures falls into one of two possible categories, (i): scenarios with misaligned AI (in which case ATA is irrelevant), or (ii): scenarios where there will be plenty of time to do ATA later and so we should defer it to future, potentially enhanced humans and their AI assistants. The present post will be focused on countering arguments along these lines. We will further argue that these risks can be reduced by doing ATA. The conclusion is that it is important that ATA work starts now. However, there does not appear to exist any research project dedicated to ATA. This seems like a mistake to us.

Alignment Target Analysis is urgent

Let’s start by briefly looking at one common class of AI safety plans that does not feature ATA until a much later point. It goes something like this: Let’s make AI controllable, i.e. obedient, helpful, not deceptive, ideally without long-term goals on its own really, just there to follow our instructions. We don’t align those AIs to anything more ambitious or object-level. Once we succeed at that, we can use those AIs to help us figure out how to build a more powerful AI sovereign safely and with the right kind of values. We’ll be much smarter with the help of those controllable AI systems, so we’ll also be in a better position to think about what to align sovereign AIs to. We can also use these controllable AI systems to buy more time for safety research, including ATA, perhaps by doing a pivotal act (in other words: use some form of instruction-following-AI to take actions that shuts down competing AI projects). So, we don’t have to worry about more ambitious alignment targets, yet. In summary: We don’t actually need to worry about anything right now other than getting to the point where we have controllable AI systems that are strong enough to either speed up our thinking or slow down AI progress or, ideally, both.

One issue with such proposals is that it seems very difficult to us to make a controllable AI system that is

able to substantially assist you with ATA or can buy you a lot of time
without already implicitly substantially having chosen an alignment target, i.e. without accidental lock-in.

If this is true, ATA is time-sensitive because it needs to happen before and alongside us developing controllable AI systems.

Why we don’t think the idea of a Pivotal Act AI (PAAI) obsoletes doing ATA now

Now, some argue that we can defer ATA by building a Pivotal Act AI (PAAI) that can stop all competing AI projects and hence buy us unlimited time. There are two issues with this: First, PAAI proposals need to balance buying time and avoiding accidental lock-in. The more effective an AI is at implementing a pivotal act of a type that reliably prevents bad outcomes, the higher the risk you have already locked something in.

For an extreme example, if your pivotal act is to have your AI autonomously shut down all “bad AI projects”, we almost certainly have already locked in some values. A similar issue also makes it difficult for an AI assistant to find a good alignment target without many decisions having already been made (more below). If a system reliably shuts down all bad AIs, then the system will necessarily be built on top of some set of assumptions regarding what counts as a bad AI. This would mean that many decisions regarding the eventual alignment target have already been made (which in turn means that ATA would have to happen before any such an AI is built). And if the AI does not reliably shut down all bad AI projects, then decisions will be in the hands of humans that might make mistakes.

Second, and more importantly, we haven’t yet seen enough evidence that a good pivotal act is actually feasible and that people will pursue it. In particular, current discussions of pivotal act AI seem to neglect internal time pressure. For example, we might end up in a situation where early AI is in the hands of a messy coalition of governments that are normally adversaries. Such a coalition is unlikely to pursue a unified, optimized strategy. Some members of the coalition will probably be under internal political pressure to train and deploy the next generation of AIs. Even rational, well informed, and well intentioned governments might decide to take a calculated risk and act decisively before the coalition collapses.

If using the PAAI requires consensus, then the coalition might decide to take a decisive action before an election in one of the countries involved. Even if everyone involved is aware that this is risky, the option of ending up in a situation where the PAAI can no longer be used to prevent competing AI projects might be seen as more risky. An obvious such action would be to launch a Sovereign AI, aiming at whatever happens to be the state of the art alignment target at the time (in other words: build an autonomous AI with whatever goal is the current state of the art proposed AI goal at the time). Hence, even if we assume that the PAAI in question could be used to give them infinite time, it is not certain that a messy coalition would use it in this way, due to internal conflicts.

Besides issues related to reasonable people trying to do the right thing by taking calculated risks, another issue is that the leaders of some countries might prefer that all important decisions are made before their term of office expire (for example by giving the go ahead to a Sovereign AI project that is aiming at their favorite alignment target).

An alternative to a coalition of powerful countries would be to have the PAAI be under the control of a global electorate. In this case, a large but shrinking majority might decide to act before existing trends turn their values into a minority position. Political positions changing in fairly predictable ways is an old phenomenon. Having a PAAI that can stop outside actors from advancing unauthorized AI projects wouldn’t change that.

In addition, if we are really unlucky, corrigibility of weak systems can make things worse. Consider the case where a corrigibility method (or whatever method you use to control your AIs) turns out to work for an AI that is used to shut down competing AI projects, but does not work for sovereign AIs. If they have such a partially functional corrigibility technique, they might take the calculated risk of launching a sovereign AI that they hope is also corrigible (thinking that this is likely, because the method worked on a non sovereign AI). Thus, if the state of the art alignment target has a flaw, then discovering this flaw is urgent. See also this post.

To summarize: Even if someone can successfully prevent outside actors from making AI progress, i.e. if we assume the existence of a PAAI that could, in principle, be used to give humanity infinite time for reflection, that doesn’t guarantee a good outcome. Some group of humans would still be in control (since it is not possible to build a PAAI that prevents them from aiming at a bad alignment target without locking in important decisions). That group might still find themselves in a time crunch due to internal power struggles and other dynamics between themselves. In this case, the humans might decide to take a calculated risk and aim at the best alignment target they know of (which at the current level of ATA progress would be exceptionally dangerous).

However, this group of humans might be open to clear explanations of why their favorite alignment target contains a flaw that would lead to a catastrophic outcome. An argument of the form: “the alignment target that you are advocating for would have led to this specific horrific outcome, for these specific reasons’’ might be enough to make part of a shrinking majority hesitate, even if they would strongly prefer that all important decisions are finalized before they lose power. First however, the field of ATA would need to advance to the point where it is possible to notice the problem in question.

Why we don’t think human augmentation and AI assistance obsolete doing ATA now

Some people might argue that we can defer ATA to the future not because we will have virtually unlimited calendar time but because we will have augmented humans or good AI assistants that will allow us to do ATA much more effectively in the future. This might not buy us much time in calendar months but a lot of time in subjective months to work on ATA.

Why we don’t think the idea of augmenting humans obsoletes doing ATA now

If one is able to somehow create smarter augmented humans, then it is possible that everything works out even without any non-augmented human ever making any ATA progress at all. In order to conclude that this idea obsoletes doing ATA now, one however needs to make a lot of assumptions. It is not sufficient to assume that a project will succeed in creating augmented humans that are both very smart and also well intentioned.

For example, the augmented humans might be very good at figuring out how to hit a specified alignment target while not being very good at ATA since they are two different types of skills. One issue is that making people better at hitting alignment targets might simply be much easier than making them better at ATA. A distinct issue is that (regardless of relative difficulty levels) the first project that succeeds at creating augments that are good at hitting alignment targets, might not have spent a lot of effort to ensure that these augments are also good at ATA. In other words: augmented humans might not be good at ATA, simply because the first successful project never even bothered to try to select for this.

It is important to note that ATA can still help prepare us for scenarios with augmented humans that aren’t better than non augmented humans at ATA, even if it does not result in any good alignment target. To be useful, ATA only needs to find the flaw in alignment targets (before the augmented humans respond to some time crunch, by taking the calculated risk of launching a Sovereign AI, aiming at this alignment target). If the flaw is found in time, then the augmented humans would have no choice other than to keep trying different augmentation methods, until this process results in some mind that is able to make genuine progress on ATA (because they do not have access to any nice-seeming alignment targets).

Accidental value lock-in vs. competence tension for AI assistants

When it comes to deferring to future AI assistants, we have additional issues to consider: We want a relatively weak controllable AI assistant that can help a lot with ATA. And we don’t want this AI to effectively lock in a set of choices. However, there is a problem. The more helpful an AI system is at helping you with ATA, the more risk you run of already having locked in some values accidentally.

Consider an AI that is just trying to help us achieve “what we want to achieve”. Once we give it larger and larger tasks, the AI has to do a lot of interpretation to understand what that means. For an AI to be able to help us achieve “what we want to achieve”, and prevent us from deviating from this, it must have a definition of what that means. Finding a good definition of “what we want to achieve” likely requires value judgments that we don’t want to hand over to AIs. If the system has a definition of “what we want to achieve”, then some choices are effectively already made.

To illustrate: For “help us achieve what we want to achieve” to mean something, one must specify how to deal with disagreements, amongst individuals that disagree on how to deal with disagreements. Without specifying this one cannot refer to “we”. There are many different ways of dealing with such disagreements, and they imply importantly different outcomes. One example of how one can deal with such disagreements is the negotiation rules of PCEV, mentioned above. In other words: if an AI does not know what is meant by “what we want to achieve”, then it will have difficulties helping us solve ATA. But if it does know what “what we want to achieve” means, then important choices have already been made. And if the choice had been made to use the PCEV way of dealing with disagreements, then we would have locked in everything that is implied by this choice. This includes locking in the fact that individuals who intrinsically value hurting other individuals, will have far more power over the AI than individuals that do not have such values.

If we consider scenarios with less powerful AI that just don’t have any lock-in risk by default, then they might not be able to provide substantial help with ATA: They currently seem better at tasks that have lots of data, short horizons, and aren’t very conceptual. These things don’t seem to apply to ATA.

None of this is meant to suggest that AI assistants cannot help with ATA. It is entirely possible that some form of carefully constructed AI assistant will speed up ATA progress to some degree, without locking in any choices (one could for example build an assistant that has a unique perspective but is not smarter than humans. Such an AI might provide meaningful help with conceptual work, without its definitions automatically dominating the outcome). But even if this does happen, it is unlikely to speed you up enough to obsolete current work.

Spuriously opinionated AI assistants

AI systems might also just genuinely be somewhat opinionated for reasons that are not related to anyone making a carefully considered tradeoff. If the AI is opinionated in an unintended way and its opinions matter for what humans choose to do, we run the risk of already having accidentally chosen some alignment target by the time we designed the helpful, controllable AI assistant. We just don’t know what alignment target we have chosen.

If we look at current AI systems, this scenario seems fairly plausible. Current AIs aren’t actually trained purely for “what the user wants” but instead are directly trained to comply with certain moral ideas. It seems very plausible these moral ideas (alongside whatever random default views the AI has about, say, epistemics) will make quite the difference for ATA. It seems plausible that current AIs already are quite influential on people’s attitudes and will increasingly become so. This problem exists even if careful efforts are directed towards avoiding it.

Will we actually have purely corrigible AI assistants?

There exists a third issue, that is separate from both issues mentioned above: even if some people do plan to take great care when building AI assistants, there is no guarantee that such people will be the first ones to succeed. It does not seem to us to be the case, that everyone is in fact paying careful attention towards what kinds of values and personalities we are currently training into our AIs. As a fourth separate issue, despite all the talk about corrigibility and intent alignment, it doesn’t seem obvious at all that most current AI safety efforts differentially push towards worlds where AIs are obedient, controllable etc. as opposed to having specific contentful properties.

Relationship between ATA and other disciplines

There are many disciplines that seem relevant to ATA such as: voting theory, moral uncertainty, axiology, bargaining, political science, moral philosophy, etc. Studying solutions in these fields is an important part of ATA work. But it is necessary to remember that the lessons learned by studying these different contexts might not be valid in the AI context. Since concepts can behave in new ways in the AI context, studying these other fields cannot replace ATA. This implies that in order to build up good intuitions about how various concepts will behave in the AI context, it will be necessary to actually explore these concepts in the AI context. In other words: it will be necessary to do ATA. This is another reason for thinking that the current lack of any serious research effort dedicated to ATA is problematic.

Let’s illustrate the problem of transferring proposals from different contexts to AI with PCEV as an example. The problem that PCEV suffers from as an alignment target is not an issue in the original proposal. The original proposal made by Bostrom is a mapping from a set of weighted ethical theories and a situation to a set of actions (that an individual can use to find a set of actions that can be given the label “morally permissible”). It is unlikely that a given person will put credence in a set of ethical theories that specifically refer to each other, and specifically demands that other theories must be hurt as much as possible. In other words: ethical theories that want to hurt other theories do get a negotiation advantage in the original proposal but this advantage is not a problem in the original context.

In a population of billions however, some individuals will want to hurt other individuals. So here the negotiation advantage is a very big problem. One can describe this as the concept behaving very differently when it is transferred to the AI context. There is nothing particularly unusual about this. It is fairly common for ideas to stop working when they are used in a completely novel context. But it is still worth making this explicit, and important to keep this in mind when thinking about alignment target proposals that were originally designed for a different context. Because there are many aspects of the AI context that are quite unusual.

To illustrate this with another example, this time with a concept from ordinary politics transferred to the AI context. Let’s write Condorcet AI (CAI) for any AI that picks outcomes using a rule that conforms to the Condorcet Criterion or Garrabrant’s Lottery Condorcet Criterion. If a barely caring 51 % solid majority (who agree about everything) would sort of prefer that a 49 % minority be hurt as much as possible, then any CAI will hurt the 49 % minority as much as it can. (It follows directly from the two linked definitions that a 51 % solid majority always gets their highest ranked option implemented without compromise). Ordinary politics does have issues with minorities being oppressed. But in ordinary politics there does not exist any entity that can suddenly start massively oppressing a 49 % minority without any risk or cost. And without extrapolation, solid majorities are a lot less important as a concept. Therefore, ordinary politics does not really contain anything corresponding to the above scenario. In other words: the Condorcet Criterion behaves differently when it is transferred to the AI context.

Alignment Target Analysis is tractable

We’ve argued that alignment to the wrong alignment target can be both catastrophic and non-obvious. We also argued that people might need, want to or just will align their AIs to a specific target relatively soon, without sufficient help from AI assistants or the ability to stall for time. This makes ATA time-sensitive and important. It also is tractable. One way that such research could move forwards, would be to iterate through alignment targets the usual scientific way: Propose them. Wait until someone finds a critical flaw. Propose an adjustment. Wait until someone finds a critical flaw to the new state of art. Repeat. Hopefully, this will help us identify necessary features of good alignment targets. While it seems really hard to tell whether an alignment target is good, this helps us to at least tell when an alignment target is bad. And noticing that a bad alignment target is in fact bad reduces the danger of it being implemented.

A more ambitious branch of ATA could try to find a good alignment target instead of purely analyzing them. Coming up with a good alignment target and showing that it is good seems much, much harder than finding flaws in existing proposals. However, the example with PCEV showed that it is possible to reduce these dangers without finding any good alignment target. In other words: an ATA project does not have to attempt to find a solution to be valuable. Because it can still reduce the probability of worst-case outcomes.

It is also true in general that looking ahead, and seeing what is waiting for us down the road, might be useful in hard to predict ways. It all depends on what one finds. Perhaps, to the extent that waiting for humanity to become more capable before committing to an alignment target or stalling for time are possible (just not guaranteed), ATA can help motivate doing so. It’s possible that, after some amount of ATA, we will conclude that humans, as we currently exist, should never try to align an AI to an alignment target we came up with. In such a scenario we might have no choice but to hope that enhanced humans will be able to handle this (even though there is no guarantee that enhancing the ability to hit an alignment target will reliably enhance the ability to analyze alignment targets).

Limitations of the present post, and possible ways forwards

There is a limit to how much can be achieved by arguing against a wide range of unstated arguments, implicit in the non-existence of any current ATA research project. Many people both consider it possible that a powerful autonomous AI will exist at some point, and also think that it matters what goal such an AI would have. So the common implicit position that ATA now is not needed must rest on positive argument(s). These arguments will be different for different people, and it is difficult to counter all possible arguments in a single post. Each such argument is best treated separately (for example along the lines of these three posts, that each deal with a specific class of arguments).

The status quo is that not much ATA is being done, so we made a positive case for it. However, the situation to us looks as follows:

We will need an alignment target eventually,
Alignment targets that intuitively sound good might be extremely bad, maybe worse than extinction bad, in ways that aren’t obvious.

This seems like a very bad and dangerous situation to be in. To argue that we should stay in this situation, without at least making a serious effort to improve things, requires a positive argument. In our opinion, the current discourse arguing for focusing exclusively on corrigibility, intent alignment, human augmentation, and buying time (because they help with ATA in the long-run) does not succeed at providing such an argument. Concluding that some specific idea should be pursued, does not imply that the idea in question obsoletes doing ATA now. But the position that doing ATA now is not needed, is sort of implicit in the current lack of any research project dedicated to ATA. ATA seems extremely neglected with, as far as we can tell, 0 people working on it full time.

We conclude this post by urging people, who feel confident that doing ATA now is not needed, to make an explicit case for this. The fact that there currently does not exist any research project dedicated to ATA indicates that there exists plenty of people that consider this state of affairs to be reasonable (probably for a variety of different reasons). Hopefully the present text will lead to the various arguments in favor of this position, that people find convincing, to be made explicit and in public. A natural next step would then be to engage with those arguments individually.

Acknowledgements

We would like to thank Max Dalton, Oscar Delaney, Rose Hadshar, John Halstead, William MacAskill, Fin Moorhouse, Alejandro Ortega, Johanna Salu, Carl Shulman, Bruce Tsai, and Lizka Vaintrob, for helpful comments on an earlier draft of this post. This does not imply endorsement.