The Superalignment team’s goal is “to build a roughly human-level automated alignment researcher”.
Human-level AI systems sound capable enough to cause a global catastrophe if misaligned. So is the plan to make sure that these systems are definitely aligned (if so, how?), or to make sure that they are deployed in a such a way that they would not be able to take catastrophic actions even if they want to (if so, what would that look like?)?
The Superalignment team’s goal is “to build a roughly human-level automated alignment researcher”.
Human-level AI systems sound capable enough to cause a global catastrophe if misaligned. So is the plan to make sure that these systems are definitely aligned (if so, how?), or to make sure that they are deployed in a such a way that they would not be able to take catastrophic actions even if they want to (if so, what would that look like?)?
(I’ve just realised this is close to just a rephrasing of some of the other suggestions. Could be a helpful rephrasing though.)