Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
We’re grateful to our advisors Nate Soares, John Wentworth, Richard Ngo, Lauro Langosco, and Amy Labenz. We’re also grateful to Ajeya Cotra and Thomas Larsen for their feedback on the contests.
TLDR: AI Alignment Awards is running two contests designed to raise awareness about AI alignment research and generate new research proposals. Prior experience with AI safety is not required. Promising submissions will win prizes up to $100,000 (though note that most prizes will be between $1k-$20k; we will only award higher prizes if we receive exceptional submissions.)
You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups.)
What are the contests?
We’re currently running two contests:
Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
Shutdown Problem Contest (based on Soares et al., 2015): Given that powerful AI systems might resist attempts to turn them off, how can we make sure they are open to being shut down?
What types of submissions are you interested in?
For the Goal Misgeneralization Contest, we’re interested in submissions that do at least one of the following:
Propose techniques for preventing or detecting goal misgeneralization
Propose ways for researchers to identify when goal misgeneralization is likely to occur
Identify new examples of goal misgeneralization in RL or non-RL domains. For example:
We might train an imitation learner to imitate a “non-consequentialist” agent, but it actually ends up learning a more consequentialist policy.
We might train an agent to be myopic (e.g., to only care about the next 10 steps), but it actually learns a policy that optimizes over a longer timeframe.
Suggest other ways to make progress on goal misgeneralization
For the Shutdown Problem Contest, we’re interested in submissions that do at least one of the following:
Propose ideas for solving the shutdown problem or designing corrigible AIs. These submissions should also include (a) explanations for how these ideas address core challenges raised in the corrigibility paper and (b) possible limitations and ways the idea might fail
Define The Shutdown Problem more rigorously or more empirically
Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)
Strengthen existing approaches to training corrigible agents (e.g., by making them more detailed, exploring new applications, or describing how they could be implemented)
Identify new challenges that will make it difficult to design corrigible agents
Suggest other ways to make progress on corrigibility
Why are you running these contests?
We think that corrigibility and goal misgeneralization are two of the most important problems that make AI alignment difficult. We expect that people who can reason well about these problems will be well-suited for alignment research, and we believe that progress on these subproblems would be meaningful advances for the field of AI alignment. We also think that many people could potentially contribute to these problems (we’re only aware of a handful of serious attempts at engaging with these challenges). Moreover, we think that tackling these problems will offer a good way for people to “think like an alignment researcher.”
We hope the contests will help us (a) find people who could become promising theoretical and empirical AI safety researchers, (b) raise awareness about corrigibility, goal misgeneralization, and other important problems relating to AI alignment, and (c) make actual progress on corrigibility and goal misgeneralization.
Who can participate?
Anyone can participate.
What if I’ve never done AI alignment research before?
You can still participate. In fact, you’re our main target audience. One of the main purposes of AI Alignment Awards is to find people who haven’t been doing alignment research but might be promising fits for alignment research. If this describes you, consider participating. If this describes someone you know, consider sending this to them.
Note that we don’t expect newcomers to come up with a full solution to either problem (please feel free to prove us wrong, though). You should feel free to participate even if your proposal has limitations.
How can I help?
You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups) or specific individuals (e.g., your smart friend who is great at solving puzzles, learning about new topics, or writing about important research topics.)
Feel free to use the following message:
AI Alignment Awards is offering up to $100,000 to anyone who can make progress on problems in alignment research. Anyone can participate. Learn more and apply at alignmentawards.com!
Will advanced AI be beneficial or catastrophic? We think this will depend on our ability to align advanced AI with desirable goals – something researchers don’t yet know how to do.
We’re running contests to make progress on two key subproblems in alignment:
The Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
The Shutdown Contest (based on Soares et al., 2015): Advanced AI systems might resist attempts to turn them off. How can we design AI systems that are open to being shut down, even as they get increasingly advanced?
No prerequisites are required to participate. EDIT: The deadline has been extended to May 1, 2023.
To learn more about AI alignment, see alignmentawards.com/resources.
Outlook
We see these contests as one possible step toward making progress on corrigibility, goal misgeneralization, and AI alignment. With that in mind, we’re unsure about how useful the contest will be. The prompts are very open-ended, and the problems are challenging. At best, the contests could raise awareness about AI alignment research, identify particularly promising researchers, and help us make progress on two of the most important topics in AI alignment research. At worst, they could be distracting, confusing, and difficult for people to engage with (note that we’re offering awards to people who can define the problems more concretely.)
If you’re excited about the contest, we’d appreciate you sharing this post and the website (alignmentawards.com) to people who might be interested in participating. We’d also encourage you to comment on this post if you have ideas you’d like to see tried.
Just wanted to say, I love the 500-word limit. A contest that doesn’t goodhart on effort moralization!