Why is it useful to emphasize a relatively sharp distinction between “treacherous turn” and “you get what you measure”? Since (a) outer alignment failures could cause treacherous turns, and (b) there’s arguably* been no prominent case made for “you get what you measure” scenarios not involving treacherous turns, my best guess is that “you get what you measure” threat models are a subset of “treacherous turn” threat models.
Why is it useful to think of AI-influenced coordination failures as a major threat model in the alignment landscape? My intuition would be to think of it as falling under capabilities (since the worry, if I understand it, is that—even if AI systems are aligned with their users—bad things will still happen because coordination is hard?).
*As noted here, WFLLP1 includes, as a key mechanism for bad things getting locked in: “Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.” Christiano’s more recentwritings on outer alignment failures also seem to emphasize deceptive/adversarial dynamics like hacking sensors, which seems pretty treacherous to me. This emphasis seems right; it seems like, by default, any kind of misalignment of sufficiently competent agents (which people are incentivized to eventually design) creates incentives for deception (followed by a treacherous turn), making “you get what you measure” / outer misalignment a subcategory of “treacherous turn.”
Why is it useful to think of AI-influenced coordination failures as a major threat model in the alignment landscape? My intuition would be to think of it as falling under capabilities (since the worry, if I understand it, is that—even if AI systems are aligned with their users—bad things will still happen because coordination is hard?).
This may be a disagreement about semantics. As I see it, my goal as an alignment researcher is to do whatever I can to reduce x-risk from powerful AI. And given my skillset, I mostly focus on how I can do this with technical research. And, if there are ways to shape technical development of AI that leads to better cooperation, and this reduces x-risk, I count that as part of the alignment landscape.
Another take is Critch’s description of extending alignment to groups of systems and agents, giving the multi-multi alignment problem of ensuring alignment between groups of humans and groups of AIs who all need to coordinate. I discuss this a bit more in the next post.
You’re right, this seems like mostly semantics. I’d guess it’s most clear/useful to use “alignment” a little more narrowly—reserving it for concepts that actually involve aligning things (i.e. roughly consistently with non-AI-specific uses of the word “alignment”). But the Critch(/Dafoe?) take you bring up seems like a good argument for why AI-influenced coordination failures fall under that.
Thanks for writing this! Seems useful.
Questions about the overview of threat models:
Why is it useful to emphasize a relatively sharp distinction between “treacherous turn” and “you get what you measure”? Since (a) outer alignment failures could cause treacherous turns, and (b) there’s arguably* been no prominent case made for “you get what you measure” scenarios not involving treacherous turns, my best guess is that “you get what you measure” threat models are a subset of “treacherous turn” threat models.
Why is it useful to think of AI-influenced coordination failures as a major threat model in the alignment landscape? My intuition would be to think of it as falling under capabilities (since the worry, if I understand it, is that—even if AI systems are aligned with their users—bad things will still happen because coordination is hard?).
*As noted here, WFLLP1 includes, as a key mechanism for bad things getting locked in: “Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.” Christiano’s more recent writings on outer alignment failures also seem to emphasize deceptive/adversarial dynamics like hacking sensors, which seems pretty treacherous to me. This emphasis seems right; it seems like, by default, any kind of misalignment of sufficiently competent agents (which people are incentivized to eventually design) creates incentives for deception (followed by a treacherous turn), making “you get what you measure” / outer misalignment a subcategory of “treacherous turn.”
This may be a disagreement about semantics. As I see it, my goal as an alignment researcher is to do whatever I can to reduce x-risk from powerful AI. And given my skillset, I mostly focus on how I can do this with technical research. And, if there are ways to shape technical development of AI that leads to better cooperation, and this reduces x-risk, I count that as part of the alignment landscape.
Another take is Critch’s description of extending alignment to groups of systems and agents, giving the multi-multi alignment problem of ensuring alignment between groups of humans and groups of AIs who all need to coordinate. I discuss this a bit more in the next post.
You’re right, this seems like mostly semantics. I’d guess it’s most clear/useful to use “alignment” a little more narrowly—reserving it for concepts that actually involve aligning things (i.e. roughly consistently with non-AI-specific uses of the word “alignment”). But the Critch(/Dafoe?) take you bring up seems like a good argument for why AI-influenced coordination failures fall under that.