We have three questions about the plan of using AI systems to align more
capable AI systems.
What form should your research output take if things go
right? Specifically, what type of output would want your automated
alignment researcher to produce in the search for a solution to
the alignment problem? Is the plan to generate formal proofs, sets
of heuristics, algorithms with explanations in natural language or
something else?
How would you verify that your automated alignment researcher is
sufficiently aligned? What’s counts as evidence and what doesn’t?
Related to the question above, how can one evaluate the output of
this automated alignment researcher? This could range from a proof
with formal guarantees to a natural language description of a technique
together with a convincing explanation. As an example, the Underhanded
C Contest is a
setup in which malicious outputs can be produced and not detected, or
if they are detected there is high plausible deniability of there being
an honest mistake.
Are you planning to find a way of aligning arbitrarily powerful
superintelligences, or are you planning to align AI systems that are
slightly more powerful than the automated alignment researcher?
In the second case, what degree of alignment do you think is
sufficient? Would you expect that alignment that is not very close
to 100% to become a problem with iterating this approach, similar to
instability in
numerical analysis?
We have three questions about the plan of using AI systems to align more capable AI systems.
What form should your research output take if things go right? Specifically, what type of output would want your automated alignment researcher to produce in the search for a solution to the alignment problem? Is the plan to generate formal proofs, sets of heuristics, algorithms with explanations in natural language or something else?
How would you verify that your automated alignment researcher is sufficiently aligned? What’s counts as evidence and what doesn’t? Related to the question above, how can one evaluate the output of this automated alignment researcher? This could range from a proof with formal guarantees to a natural language description of a technique together with a convincing explanation. As an example, the Underhanded C Contest is a setup in which malicious outputs can be produced and not detected, or if they are detected there is high plausible deniability of there being an honest mistake.
Are you planning to find a way of aligning arbitrarily powerful superintelligences, or are you planning to align AI systems that are slightly more powerful than the automated alignment researcher? In the second case, what degree of alignment do you think is sufficient? Would you expect that alignment that is not very close to 100% to become a problem with iterating this approach, similar to instability in numerical analysis?