Here’s my proposal for a contest description. Contest problems #1 and 2 are inspired by Richard Ngo’s Alignment research exercises.
AI alignment is the problem of ensuring that advanced AI systems take actions which are aligned with human values. As AI systems become more capable and approach or exceed human-level intelligence, it becomes harder to ensure that they remain within human control instead of posing unacceptable risks.
One solution to AI alignment proposed by Stuart Russell, a leading AI researcher, is the assistance game, also called a cooperative inverse reinforcement learning (CIRL) game, following these principles:
“The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.”
For a more formal specification of this proposal, please see Stuart Russell’s new book on why we need to replace the standard model of AI, Cooperatively Learning Human Values, and Cooperative Inverse Reinforcement Learning.
Contest problem #1: Why are assistance games not an adequate solution to AI alignment?
The first link describes a few critiques; you’re free to restate them in your own words and elaborate on them. However, we’d be most excited to see a detailed, original exposition of one or a few issues, which engages with the technical specification of an assistance game.
Another proposed solution to AI alignment is iterated distillation and amplification (IDA), proposed by Paul Christiano. Paul runs the Alignment Research Center and previously ran the language model alignment team at OpenAI. In IDA, a human H wants to train an AI agent, X by repeating two steps: amplification and distillation. In the amplification step, the human uses multiple copies of X to help it solve a problem. In the distillation step, the agent X learns to reproduce the same output as the amplified system of the human + multiple copies of X. Then we go through another amplification step, then another distillation step, and so on.
You can learn more about this at Iterated Distillation and Amplification and see a simplified application of IDA in action at Summarizing Books with Human Feedback.
Contest problem #2: Why might an AI system trained through IDA be misaligned with human values? What assumptions would be needed to prevent that?
Contest problem #3: Why is AI alignment an important problem? What are some research directions and key open problems? How can you or other students contribute to solving it through your career?
We’d recommend reading Intro to AI Safety, Why AI alignment could be hard with modern deep learning, AI alignment—Wikipedia, My Overview of the AI Alignment Landscape: A Bird’s Eye View—AI Alignment Forum, AI safety technical research—Career review, and Long-term AI policy strategy research and implementation—Career review.
You’re free to submit to one or more of these contest problems. You can write as much or as little as you feel is necessary to express your ideas concisely; as a rough guideline, feel free to write between 300 and 2000 words. For the first two content problems, we’ll be evaluating submissions based on the level of technical insight and research aptitude that you demonstrate, not necessarily quality of writing.
I like how contest problems #1 and 2:
provide concrete proposals for solutions to AI alignment, so it’s not an impossibly abstract problem
ask participants to engage with prior research and think about issues, which seems to be an important aspect of doing research
are approachable
Contest problem #3 here isn’t a technical problem, but I think it can be helpful so that participants actually end up caring about AI alignment rather than just engaging with it on a one-time basis as part of this contest. I think it would be exciting if participants learned on their own about why AI alignment matters, form a plan for how they could work on it as part of their career, and end up motivated to continue thinking about AI alignment or to support AI safety field-building efforts in India.
Technical note: I think we need to be careful to note the difference in meaning between extinction and existential catastrophe. When Joseph Carlsmith talks about existential catastrophe, he doesn’t necessarily mean all humans dying; in this report, he’s mainly concerned about the disempowerment of humanity. Following Toby Ord in The Precipice, Carlsmith defines an existential catastrophe as “an event that drastically reduces the value of the trajectories along which human civilization could realistically develop”. It’s not straightforward to translate his estimates of existential risk to estimates of extinction risk.
Of course, you don’t need to rely on Joseph Carlsmith’s report to believe that there’s a ≥7.9% chance of human extinction conditioning on AGI.