DeepMind: Model evaluation for extreme risks

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through “dangerous capability evaluations”) and the propensity of models to apply their capabilities for harm (through “alignment evaluations”). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

This is the first great public writeup on model evals for averting existential catastrophe. I think it’s likely that if AI doesn’t kill everyone, developing great model evals and causing everyone to use them will be a big part of that. So I’m excited about this paper both for helping AI safety people learn more and think more clearly about model evals and for getting us closer to it being common knowledge that responsible labs should use model evals and responsible authorities should require them (by helping communicate model evals more widely, in a serious/​legible manner).

Non-DeepMind authors include Jade Leung (OpenAI governance lead), Daniel Kokotajlo (OpenAI governance), Jack Clark (Anthropic cofounder), Paul Christiano, and Yoshua Bengio.

See also DeepMind’s related blogpost.

For more on model evals for AI governance, see ARC Evals, including Beth’s EAG talk Safety evaluations and standards for AI and the blogpost Update on ARC’s recent eval efforts (LW).

Crossposted from LessWrong (94 points, 12 comments)