A Pragmatic Vision for Interpretability

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well^[1]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
- Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
- See our companion piece for more on which research areas and theories of change we think are promising
Why pivot now? We think that times have changed.
- Models are far more capable, bringing new questions within empirical reach
- We have been disappointed by the amount of progress made by ambitious mech interp work, from both us and others^[2]
- Most existing interpretability techniques struggle on today’s important behaviours, e.g. they involve large models, complex environments, agentic behaviour and long chains of thought
Problem: It is easy to do research that doesn’t make real progress.
- Our approach: ground your work with a North Star—a meaningful stepping-stone goal towards AGI going well—and a proxy task—empirical feedback that stops you fooling yourself and that tracks progress toward the North Star.
- “Proxy tasks” doesn’t mean boring benchmarks. Examples include: interpret the hidden goal of a model organism; stop emergent misalignment without changing training data; predict what prompt changes will stop an undesired behavior.
We see two main approaches to research projects: focused projects (proxy task driven), and exploratory projects (curiosity-driven, proxy task validated)
- Curiosity-driven work can be very effective, but can also get caught in rabbit holes. We recommend starting in a robustly useful setting, time box your exploration^[3], and finding a proxy task as a validation step^[4]
We advocate method minimalism: start solving your proxy task with the simplest methods (e.g. prompting, steering, probing, reading chain-of-thought). Introduce complexity or design new methods only once baselines have failed.

Read the full post here, and the companion piece on promising AGI Safety relevant research directions here