I just listened to Andrew Critchâs interview about âAI Research Considerations for Human Existential Safetyâ (ARCHES). I took some notes on the podcast episode, which Iâll share here. I wonât attempt to summarize the entire episode; instead, please see this summary of the ARCHES paper in the Alignment Newsletter.
We need to explicitly distinguish between âAI existential safetyâ and âAI safetyâ writ large. Saying âAI safetyâ without qualification is confusing for both people who focus on near-term AI safety problems and those who focus on AI existential safety problems; it creates a bait-and-switch for both groups.
Although existential risk can refer to any event that permanently and drastically reduces humanityâs potential for future development (paraphrasing Bostrom 2013), ARCHES only deals with the risk of human extinction because itâs easier to reason about and because itâs not clear what other non-extinction outcomes are existential events.
ARCHES frames AI alignment in terms of delegation from m â„ 1 human stakeholders (such as individuals or organizations) to n â„ 1 AI systems. Most alignment literature to date focuses on the single-single setting (one principal, one agent), but such settings in the real world are likely to evolve into multi-principal, multi-agent settings. Computer scientists interested in AI existential safety should pay more attention to the multi-multi setting relative to the single-single one for the following reasons:
There are commercial incentives to develop AI systems that are aligned with respect to the single-single setting, but not to make sure they wonât break down in the multi-multi setting. A group of AI systems that are âalignedâ with respect to single-single may still precipitate human extinction if the systems are not designed to interact well.
Single-single delegation solutions feed into AI capabilities, so focusing only on single-single delegation may increase existential risk.
What alignment means in the multi-multi setting is more ambiguous because the presence of multiple stakeholders engenders heterogeneous preferences. However, predicting whether humanity goes extinct in the multi-multi setting is easier than predicting whether a group of AI systems will âoptimallyâ satisfy a groupâs preferences.
Critch and Krueger coin the term âprepotent AIâ to refer to an AI system that is powerful enough to transform Earthâs environment at least as much as humans have and where humans cannot effectively stop or reverse these changes. Importantly, a prepotent AI need not be an artificial general intelligence.
I just listened to Andrew Critchâs interview about âAI Research Considerations for Human Existential Safetyâ (ARCHES). I took some notes on the podcast episode, which Iâll share here. I wonât attempt to summarize the entire episode; instead, please see this summary of the ARCHES paper in the Alignment Newsletter.
We need to explicitly distinguish between âAI existential safetyâ and âAI safetyâ writ large. Saying âAI safetyâ without qualification is confusing for both people who focus on near-term AI safety problems and those who focus on AI existential safety problems; it creates a bait-and-switch for both groups.
Although existential risk can refer to any event that permanently and drastically reduces humanityâs potential for future development (paraphrasing Bostrom 2013), ARCHES only deals with the risk of human extinction because itâs easier to reason about and because itâs not clear what other non-extinction outcomes are existential events.
ARCHES frames AI alignment in terms of delegation from m â„ 1 human stakeholders (such as individuals or organizations) to n â„ 1 AI systems. Most alignment literature to date focuses on the single-single setting (one principal, one agent), but such settings in the real world are likely to evolve into multi-principal, multi-agent settings. Computer scientists interested in AI existential safety should pay more attention to the multi-multi setting relative to the single-single one for the following reasons:
There are commercial incentives to develop AI systems that are aligned with respect to the single-single setting, but not to make sure they wonât break down in the multi-multi setting. A group of AI systems that are âalignedâ with respect to single-single may still precipitate human extinction if the systems are not designed to interact well.
Single-single delegation solutions feed into AI capabilities, so focusing only on single-single delegation may increase existential risk.
What alignment means in the multi-multi setting is more ambiguous because the presence of multiple stakeholders engenders heterogeneous preferences. However, predicting whether humanity goes extinct in the multi-multi setting is easier than predicting whether a group of AI systems will âoptimallyâ satisfy a groupâs preferences.
Critch and Krueger coin the term âprepotent AIâ to refer to an AI system that is powerful enough to transform Earthâs environment at least as much as humans have and where humans cannot effectively stop or reverse these changes. Importantly, a prepotent AI need not be an artificial general intelligence.