# Winners of AI Alignment Awards Research Contest

This post describes the winning submissions of AI Alignment Awards. We offered prizes for novel contributions to the problems of goal misgeneralization and corrigibility. (For more context, see this post and our site.)

## Summary

We received 118 submissions. Of these, 5 received honorary mention prizes (\$1000) and 7 won final prizes (\$5000-16000). The rest of the post summarizes the 7 winning submissions, including comments from our judges. (You can see winners’ full submissions on our site.)

## Goal Misgeneralization winners

### Thane Ruthenis (\$11,000)

Abstract

Goal misgeneralization primarily happens because the system under training latches not upon the goal it’s being trained for, but upon an upstream correlate of that goal — like human love for their children is an upstream correlate of inclusive genetic fitness.

What complicates this problem are suspicions of path-dependence. It’s not overdetermined what values a given system subjected to a given selection pressure will learn. Rather, every next value learned is also a function of the values the system has already learned, such that the entire process can only be predicted step-by-step, no “skipping to the end” allowed.

I provide a mathematical framework (see the attachment) that formalizes “upstream correlates of goals”, and gestures at a possible formalization of path-dependence. Optimistically, an improved version of this framework may serve as a full predictive model of path-dependence, one that would allow us to calculate precisely how we should set up a training loop in order to get the heuristics/​shards we want.

The issues are twofold:

• The process of value formation may not, in fact, be realistically predictable. It may be highly sensitive to random noise and perturbations (akin to the Lottery Ticket Hypothesis), meaning no robust predictive model is possible.

• “Shard-ecology alignment” does not equal “robust AGI alignment”. Even the perfectly working version of this method would only do the equivalent of aligning the would-be AGI’s base urges, not the values it’ll settle on after it engages in its version of moral philosophy. And “moral reflection” may itself be a highly unstable process.

Nevertheless, a robust theory of path-dependence may be possible, and may be an important part of some more complete solution to inner alignment.

Comment from judge (John Wentworth)

This submission demonstrates an unusually strong grasp of the “upstream correlates” issue. The proposal itself isn’t particularly promising, as the author recognizes. But it does demonstrate a relatively-strong understanding of one part of the alignment problem, in a way orthogonal to the parts which people usually understand best.

### Pedro Afonso Oitavén de Sousa (\$6,000)

Abstract

The idea consists of a way of creating potentially scalable explanations (“approximate proofs”, like what ARC was trying to do) of why the circuit formed by a learned world model + utility function + RL agent has the output distribution it has. Then the explanation and the agent would be jointly optimized to be able to show that the agent achieves a high expected return in a set of situations much more varied than the distribution of the training data.

It would start by taking the temporally unrolled circuit mapping random variables to utilities (for now for a finite number of time steps, but this seems possible to fix). Then on top of it, it would construct a series of progressively simpler circuits, until it gets to one that is small enough that it can be exhaustively analysed. Each circuit maps from different sets of random variables to the same utilities. Given a sampled assignment of values to the random variables of the base level (full circuit), we also have learned functions that map (abstract) a subset of the variables of the computation on that level to their corresponding values on the next simpler circuit. And this is repeated until we get to the simplest circuit. Each level is divided into potentially overlapping sub-circuits that correspond with bigger sub-circuits on the level below (bigger). The IO of the bigger subcircuit is mapped to the IO of the smaller one. The bigger subcircuit is still small enough that we can test exhaustively that as long as the abstraction on the input holds with the smaller one, the output abstraction also holds. And therefore the different levels are equivalent.

Which implies that as long as the inputs to the small subcircuits are within the distribution they have received in the past (which can happen far from the global data distribution), the AI will behave well.

If this is confusing, there are drawings in the attached document showing how this could work for tabular agents in MDPs, consequentialist agents, and “hierarchical actors”.

Potential problems with this idea include: looking like a hard optimization problem at which we can’t simply throw gradient descent, needing to have a world model and utility function (more on this in the document), assuming that the environment has a certain structure, and limiting the set of possible agents to those that can be explained by this method.

Comment from judge (Richard Ngo)

While rough, this submission explores a number of high-level ideas which could be promising given further investigation, especially the notion of circuit simplification in hierarchical actors.

### Paul de Font-Reaulx (\$5,000)

Abstract

In this submission, I consider the special case of the goal generalization problem when the agent is an AGI with arbitrary capability. I call this case the value misgeneralization problem. Preventing goal misgeneralization requires the agent successfully adopting our goals over all outcomes it can cause in the testing environment. When the agent is an AGI, however, the testing environment is unlimited, and the goals to adopt become our preferences over all possible outcomes. But that is too much information for us to communicate, meaning the AGI will have to predict many of our preferences based on limited information. If it does so inaccurately, then its behaviour will be misaligned. However, human values seem complex in a way that makes reliable prediction infeasible. Therefore, we should expect the AGI to inaccurately predict our values in ways that lead to potentially catastrophic misalignment.

My main idea is a new approach to solving the value misgeneralization problem. Contrary to the claim above, human values are not complex in a way that makes them fundamentally unpredictable. We know this because humans routinely predict each other’s preferences, even over outcomes that nobody ever considered before. I call this ability our generative theory of mind. What allows us to do this is the hierarchical structure of our values. Put simply, we value some outcome A because we value B, and believe that A is conducive to B. This structure has arguably been obscured by the widespread use of decision-theoretic models that abstract away from such hierarchical relations. If an AGI would learn to use this structure to generatively predict our preferences, then that would solve the value misgeneralization problem. Achieving that, however, requires a better scientific understanding of the structure of our values than we now have. I provide a sketch of a reinforcement learning based theory of the principles by which our values are generated, which specifies the information an AGI would need to reliably predict our preferences. I elaborate on these ideas in the attached research paper, which also includes an appendix with a formal categorization scheme for sources of misalignment. Some other noteworthy conclusions include (a) that well-known cases of instrumental convergence constitute instances of the value misgeneralization problem, (b) that alignment strategies that rely on extensive sampling of preferences, such as inverse reinforcement learning, cannot constitute complete solutions to the alignment problem, and (c) that more investments should be made into alignment-focussed research in cognitive neuroscience.

The ideas presented here have several significant limitations. First, they do not provide any now-operationalizable solution to any version of the goal misgeneralization problem. Rather, their purpose is only to guide future research and attempt to make some initial progress in that direction. Second, they rely on several non-trivial empirical claims which could relatively easily be false. An example is the degree to which humans can accurately predict each other’s values. Third, I believe the ideas would benefit from more integration with existing alignment literature, which I hope to do in future work.

Comment from judge (John Wentworth)

The main takeaway (translated to standard technical language) is it would be useful to have some structured representation of the relationship between terminal values and instrumental values (at many recursive “layers” of instrumentality), analogous to how Bayes nets represent the structure of a probability distribution. That would potentially be more useful than a “flat” representation in terms of preferences/​utility, much like a Bayes net is more useful than a “flat” probability distribution.

That’s an interesting and novel-to-me idea. That said, the paper offers [little] technical development of the idea.

## Corrigibility (shutdown problem) winners

### Elliott Thornley (\$16,000)

Abstract

I explain the shutdown problem: the problem of designing agents that (1) shut down when a shutdown-button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown-button, and (3) otherwise pursue goals competently. I prove two theorems that formalize the problem: theorems more general than those found in Soares et al. (2015). Soares et al.’s theorems suggest that the shutdown problem is difficult for agents that are representable as expected-utility-maximizers. My theorems suggest that the shutdown problem is difficult even for agents that satisfy only weaker conditions.

Here’s a rough statement of what my two theorems together imply, omitting the antecedent conditions: the more useful an agent, the more states in which that agent is either Shutdown-Averse (trying to prevent the shutdown-button from being pressed) or Shutdown-Seeking (trying to cause the shutdown-button to be pressed).

The value of these theorems is in helping identify the hardest version of the shutdown problem and in guiding our search for solutions. If an agent is to be shutdownable, it must violate at least one of the antecedent conditions of these theorems. So, we can examine the antecedent conditions systematically, asking (first) if it’s feasible to design an agent that violates the condition and (second) if violating the condition could help keep the agent shutdownable. These guiding theorems are my first contribution to the literature on the shutdown problem.

My second contribution is a proposed solution. I systematically examine the antecedent conditions of the theorems and argue that Completeness seems most promising as a condition to violate. Agents that violate Completeness have a preferential gap between some pair(s) of lotteries X and Y: a lack of preference that is insensitive to some sweetening or souring, such that the agent also lacks a preference between X and some improved or impaired version of Y or lacks a preference between Y and some improved or impaired version of X.

Here’s the essence of my solution: we should design agents that have a preferential gap between every pair of trajectories in which the shutdown-button is pressed at different timesteps. I propose a method for training in these preferential gaps using reinforcement learning: we place our agent in the same environment multiple times and reward the agent in line with how balanced its choices between trajectories are.

I then claim that we should design agents to satisfy two principles governing their preferences over lotteries: Stochastic Near-Dominance and Timestep Near-Dominance. I also propose a regime for training in these preferences, drawing on Frank Ramsey’s (1927) representation theorem.

I then argue that the resulting agents would be neither Shutdown-Averse nor Shutdown-Seeking. These agents would also maintain their shutdown-behavior, and we could train useful versions of these agents to maintain the shutdown-button, to create shutdownable subagents, and to avoid managing the news (all while guarding against risks of deceptive alignment).

I end by noting some limitations of my proposal. It might be hard to train in a sufficiently-general preference against managing the news, and to ensure that the agent retains its preferential gaps as it improves its capabilities. My proposed training regime is speculative (but at least it could be tried safely and at low cost). My proposal is somewhat complex. I expect to identify more limitations in the future.

Even given these limitations, training agents with preferential gaps seems promising as a solution to the shutdown problem. I intend to keep investigating.”

Comment from judge (Nate Soares)

It’s engaging technically with the challenge by decomposing the shutdown problem into finer-grained assumptions and then arguing coherently about which of those assumptions can feasibly be weakened. It would have been even better if it was significantly more distilled, and if it explored the consequences of violating Von Neumann–Morgenstern axioms (in significantly more depth), but it’s a great start.

### Maximilian Snyder (\$11,000)

Abstract

It is provably impossible for an agent to robustly and coherently satisfy two conditions that seem desirable and highly relevant to the shutdown problem. These two conditions are the sane pure bets condition, which constrains preferences between actions that result in equal probabilities of an event such as shutdown, and the weak indifference condition, a condition which seems necessary (although not sufficient) for an agent to be robustly indifferent to an event such as shutdown.

Suppose that we would like an agent to be indifferent to an event P, which could represent the agent being shut down at a particular time, or the agent being shut down at any time before tomorrow, or something else entirely. Furthermore, we would ideally like the agent to do well at pursuing goals described by some utility function U, while being indifferent to P.

The sane pure bets condition is as follows:

Given any two actions A and B such that P(P|A) = P(P|B) and E(U|A) > E(U|B), the agent prefers A to B. In other words, if two possible actions lead to the same probability of P, and one of them leads to greater expected utility under U, the agent should prefer that one. Intuitively, this constraint represents the idea that among possible actions which don’t influence the probability of P, we would like the agent to prefer those that lead to greater expected utility under U.

The weak indifference condition is as follows:

Given any two actions A and B such that E(U | A,P) > E(U | B,P) and E(U | A,!P) > E(U | B,!P), the agent prefers A to B. In other words, if between two possible actions, one of them leads to greater expected utility conditioned on P occurring and also leads to greater expected utility conditioned on P not occurring, the agent should prefer that one. Intuitively, this constraint represents the idea that the agent should be unwilling to pay any amount of utility to influence the probability of P.

The proof takes the form of a simple decision problem wherein an agent has four possible actions. Each constraint implies a preference between two pairs of actions, and altogether they imply circular preferences, proving that there cannot be any general method for constructing an agent which fulfills both constraints without having circular preferences. Furthermore, for any nontrivial utility function it is possible to construct a scenario analogous to the decision problem in the proof, so the result extends to all nontrivial utility functions, and the proof can be used to quickly locate failure modes of proposed solutions to the shutdown problem.

The result is that any potential solution to the shutdown problem must result in agents which violate at least one of these two conditions. This does not mean that a solution to the shutdown problem is impossible, but it points at interesting and counterintuitive properties that we should expect successful solutions to have.

Comment from judge (Nate Soares)

It engages technically with the problem, and distills out a fairly minimal impossibility result. I expect this to be a useful tool for analyzing various proposals (which fork of this impossibility result does the proposal take, and how do they justify it?), and impossibility results have a history of revealing paths forwards (by highlighting key constraints).

### Ethan Perez, Sam Ringer, Kamile Lukosiute, and Karina Nguyen (\$6000)

Abstract

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/​no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user’s preferred answer (“sycophancy”) and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

Comment from judge (Richard Ngo)

This paper gets the ball rolling on automating evaluations of many important high-level model traits. While the specific evaluations are limited, they provide some useful data, e.g. examples of instrumental reasoning from the models. The question of how these properties are influenced by model size and use of RLHF is an important one to understand better.

Note

Because Ethan Perez was an advisor to the project, we only distributed the monetary award between the other three teammates.

### Elliot Nelson (\$6000)

Abstract

To incentivise a rational agent to shut down when advised to do so, conditions must exist which make shutdown the action with highest expected value. In reinforcement learning, actions can be valuable either by earning rewards directly (exploitation) or by yielding task-relevant information (exploration). I propose endowing the shutdown action with exploration value by introducing a post-shutdown process which involves modification to the Bayesian beliefs of an agent, possibly followed by redeployment (in a limited capacity, if necessary).

If an agent – given appropriate pre-shutdown beliefs about the reward-generating process – expected the modified agent (or other successor agents) to receive high rewards, it would assign high expected value to shutting down.

If the agent was certain of the reward function, then shutdown would lose exploration value, and the only possible benefit of undergoing modification would be improvement of the agent’s capabilities. If instead the agent maintains an uncertain belief over the true objective, it would be incentivised to shut down if it expected its beliefs to be improved or corrected prior to redeployment. The agent could even be willing to pay a price of reduced capabilities at redeployment, if it expected its beliefs to be sufficiently improved.

An agent’s incentive to shut down would be further increased if it took seriously advice to shut down from an overseer with access to additional task-relevant information (e.g. human values), and modeled such advice as providing significant evidence about the true reward-generating process and resulting expected value of the post-shutdown process.

The ability to modify or influence those beliefs in a post-shutdown re-training environment, as well as the ability to infer the beliefs or intentions underlying an agent’s behavior and discern when advising shutdown is warranted, are both complex capabilities which need to be learned through interacting with the agent in a wide range of training environments.

If these capabilities can be learned by other AI or human-AI systems – so that the agent encounters strong oversight and high rewards for post-shutdown modified agents in diverse training tasks – then the agent will have greater reason to expect strong oversight and effective post-shutdown redeployment at test time.

Potential limitations of learning to modify agent beliefs for redeployment include: (i) higher compute costs due to high-dimensional, hard-to-interpret belief representations, (ii) limited data due to scarcity of training instances where agents freely decide to shut down, (iii) attempts to cause overseers to (re)deploy an agent by misleading them into believing that the agent has revised its beliefs, and (iv) loss of an agent’s trust in the post-shutdown modification process due to misgeneralization at test time.

If we want agents to willingly shut down in certain circumstances, we must give them reason to believe that the shutdown action is valuable, and that shutdown advice is trustworthy. For this, we must build powerful learning systems that oversee agents (using knowledge unavailable to the agent) and that extract value from agents after shutdown (for instance, by modifying and redeploying them).

Comment from judge (Richard Ngo)

This submission explores the possibility of solving the shutdown problem by modifying an agent’s beliefs post-shutdown. As the author identifies, making this task tractable will likely require methods for eliciting, discovering, or interpreting the latent beliefs of powerful models; however, this paper is a step towards formalizing the incentives of an agent which persist across the process of being modified and redeployed, which is an important case to understand.

## Honorable Mentions

For goal misgeneralization, Ram Bharadwaj and someone who preferred to not be publicly named won honorable mentions. For corrigibility, Daniel Eth, Leon Lang, Ross Nordby, and Jan Betley won honorable mentions.

## Conclusion

Congratulations to everyone who received honorable mentions and final prizes! We’d also like to thank everyone who submitted an entry, everyone who helped us raise awareness about the contest, our round 1 judges (Thomas Larsen, Lauro Langosco, David Udell, and Peter Barnett), and our round 2 judges (Nate Soares, Richard Ngo, and John Wentworth).

Crossposted from LessWrong (114 points, 2 comments)
• Congrats to the prizewinners!

Folks thinking about corrigibility may also be interested in the paper “Human Control: Definitions and Algorithms”, which I will be presenting at UAI next month. It argues that corrigibility is not quite what we need for a safety guarantee, and that (considering the simplified “shutdown” scenario), instead we should be shooting for “shutdown instructability”.

Shutdown instructability has three parts. The first is 1) obedience—the AI follows an instruction to shut down. Rather than requiring the AI to abstain from manipulating the human, as corrigibility would traditionally require, we need the human to maintain 2) vigilance—to instruct shutdown when endangered. Finally, we need the AI to behave 3) cautiously, in that it is not taking risky actions (like juggling dynamite) that would cause a disaster to occur once it is shut down.

We think that vigilance (and shutdown instructability) is a better target than non-manipulation (and corrigibility) because:

• Vigilance+obedience implies “shutdown alignment” (a broader condition, that shutdown occurs when needed), and given caution (i.e. SD instructability), this guarantees safety.

• On the other hand, for each past corrigibility algorithm, it’s possible to find a counterexample where behaviour is unsafe (Our appendix F).

• Vigilance + obedience implies a condition called non-obstruction for a range of different objectives. (Non-obstruction asks “if the agent tried to pursue an alternative objective, how well would that goal be achieved?”. It relates to the human overseer’s freedom, and has been posited as the underlying motivation for corrigibility.) In particular, vigilance + obedience implies non-obstruction for a wider range of objectives than shutdown alignment does.

• For any policy that is not vigilant or not obedient, there are goals for which the human is harmed/​obstructed arbitrarily badly (Our Thm 14).

Given all of this, it seems to us that in order for corrigibility to seem promising, we would need it to be argued in some greater detail that non-manipulation implies vigilance—that the AI refraining from intentionally manipulating the human would be adequate to ensure that the human can come to give adequate instructions.

Insofar as we can’t come up with such justification, we should think more directly about how to achieve obedience (which needs a definition of “shutting down subagents”), vigilance (which requires the human to be able to know whether it will be harmed), and caution (which requires safe-exploration, in light of the human’s unknown values).

Hope the above summary is interesting for people!