I’m interviewing Jan Leike for The 80,000 Hours Podcast (personal website, Twitter).
He’s been Head of Alignment at OpenAI and is now leading their Superalignment team which will aim to figure out “how to steer & control AI systems much smarter than us” — and do it in under 4 years!
They’ve been given 20% of the compute OpenAI has secured so far in order to work on it. Read the official announcement about it or Jan Leike’s Twitter thread.
What should I ask him?
(P.S. Here’s Jan’s first appearance on the show back in 2018.)
How is the super-alignment team going to interface with the rest of the AI alignment community, and specifically what kind of work from others would be helpful to them (e.g., evaluations they would want to exist in 2 years, specific problems in interpretability that seem important to solve early, curricula for AIs to learn about the alignment problem while avoiding content we may not want them reading)?
To provide more context on my thinking that leads to this question: I’m pretty worried that OpenAI is making themselves a single point of failure in existential security . Their plan seems to be a less-disingenuous version of “we are going to build superintelligence in the next 10 years, and we’re optimistic that our alignment team will solve catastrophic safety problems, but if they can’t then humanity is screwed anyway, because as mentioned, we’re going to build the god machine. We might try to pause if we can’t solve alignment, but we don’t expect that to help much.” Insofar as a unilateralist is taking existentially risky actions like this and they can’t be stopped, other folks might want to support their work to increase the chance of the super-alignment team’s success. Insofar as I want to support their work, I currently don’t know what they need.
Another framing behind this question is just “many people in the AI alignment community are also interested in solving this problem, how can they indirectly collaborate with you (some people will want to directly collaborate, but this has corporate-closed-ness limitation).
John Wentworth has a post on Godzilla strategies where he claims that putting an AGI to solve the alignment problem is like asking Godzilla to make a larger Godzilla behave. How will you ensure you don’t overshoot the intelligence of the agent you’re using to solve alignment and fall into the “Godzilla trap”?
(Leike responds to this here if anyone is interested)
What is Leike’s role in any development/deployment decision of OpenAI? Does he (or the alignment department) have veto power?
Relatedly: what role does he think he/the alignment department should have in these decisions?
There’s a maybe naive way of seeing their plan that leads to this objection:
”Once we have AIs that are human-level AI alignment researchers, it’s already too late. That’s already very powerful and goal-directed general AI, and we’ll be screwed soon after we develop it, either because it’s dangerous in itself or because it zips past that capability level fast since it’s an AI researcher, after all.”
What do you make of it?
Thanks for asking!
How much risk of human extinction are you willing to take in a large training run (e.g. one to train GPT-5)?
I love this question, because surely there has to be a real maximum probability here.
OpenAI as a whole, and individuals affiliated with or speaking for the org, appear to be largely behaving as if they are caught in an overdetermined race toward AGI.
What proportion of people at OpenAI believe this, and to what extent? What kind of observations, or actions or statements by others (and who?) would change their minds?
The Superalignment team’s goal is “to build a roughly human-level automated alignment researcher”.
Human-level AI systems sound capable enough to cause a global catastrophe if misaligned. So is the plan to make sure that these systems are definitely aligned (if so, how?), or to make sure that they are deployed in a such a way that they would not be able to take catastrophic actions even if they want to (if so, what would that look like?)?
(I’ve just realised this is close to just a rephrasing of some of the other suggestions. Could be a helpful rephrasing though.)
What do you think are the biggest wins in technical safety so far? What do you see as the most promising strategies going forward?
OpenAI caught some flak for there being no women on the Superalignment team. Does Jan think that the demographics of the team will affect how the team conceptualizes “human intent” in ways that are relevant, at this early stage, to the work the team is doing?
Doesn’t 4 years seem like a ridiculously short time to achieve such a difficult goal? What kind of reference class does he think this problem belongs in, in terms of time and scale required? (E.g. atomic bomb?, HIV antivirals?, Hard mathematical problem?)
Relatedly, does he not wish OpenAI had taken things more slowly and given him more time?
We have three questions about the plan of using AI systems to align more capable AI systems.
What form should your research output take if things go right? Specifically, what type of output would want your automated alignment researcher to produce in the search for a solution to the alignment problem? Is the plan to generate formal proofs, sets of heuristics, algorithms with explanations in natural language or something else?
How would you verify that your automated alignment researcher is sufficiently aligned? What’s counts as evidence and what doesn’t? Related to the question above, how can one evaluate the output of this automated alignment researcher? This could range from a proof with formal guarantees to a natural language description of a technique together with a convincing explanation. As an example, the Underhanded C Contest is a setup in which malicious outputs can be produced and not detected, or if they are detected there is high plausible deniability of there being an honest mistake.
Are you planning to find a way of aligning arbitrarily powerful superintelligences, or are you planning to align AI systems that are slightly more powerful than the automated alignment researcher? In the second case, what degree of alignment do you think is sufficient? Would you expect that alignment that is not very close to 100% to become a problem with iterating this approach, similar to instability in numerical analysis?
Does he think all the properties of superintelligent systems that will be relevant for the success of alignment strategies already exist in current systems? That they will exist in systems within the next 4 years? (If not, aren’t there extremely important limitations to our ability to empirically test the strategies and figure out if they are likely to work?)
What role does he expect increasing intelligence and agency of AIs to play for the difficulty of them reaching their goal, i.e. achieving alignment of ASI (in less than 4 years)?
What chance does he estimate for them achieving this? What is the most likely reason for the approach to fail?
How could one assess the odds that their approach works?
Doesn’t one at a minimum need to know both the extent to which the alignment problem gets more difficult for increasingly intelligent & agentic AI, as well as the extent to which automated alignment research scales current safety efforts?