I just listened to Andrew Critch’s interview about “AI Research Considerations for Human Existential Safety” (ARCHES). I took some notes on the podcast episode, which I’ll share here. I won’t attempt to summarize the entire episode; instead, please see this summary of the ARCHES paper in the Alignment Newsletter.
We need to explicitly distinguish between “AI existential safety” and “AI safety” writ large. Saying “AI safety” without qualification is confusing for both people who focus on near-term AI safety problems and those who focus on AI existential safety problems; it creates a bait-and-switch for both groups.
Although existential risk can refer to any event that permanently and drastically reduces humanity’s potential for future development (paraphrasing Bostrom 2013), ARCHES only deals with the risk of human extinction because it’s easier to reason about and because it’s not clear what other non-extinction outcomes are existential events.
ARCHES frames AI alignment in terms of delegation from m ≥ 1 human stakeholders (such as individuals or organizations) to n ≥ 1 AI systems. Most alignment literature to date focuses on the single-single setting (one principal, one agent), but such settings in the real world are likely to evolve into multi-principal, multi-agent settings. Computer scientists interested in AI existential safety should pay more attention to the multi-multi setting relative to the single-single one for the following reasons:
There are commercial incentives to develop AI systems that are aligned with respect to the single-single setting, but not to make sure they won’t break down in the multi-multi setting. A group of AI systems that are “aligned” with respect to single-single may still precipitate human extinction if the systems are not designed to interact well.
Single-single delegation solutions feed into AI capabilities, so focusing only on single-single delegation may increase existential risk.
What alignment means in the multi-multi setting is more ambiguous because the presence of multiple stakeholders engenders heterogeneous preferences. However, predicting whether humanity goes extinct in the multi-multi setting is easier than predicting whether a group of AI systems will “optimally” satisfy a group’s preferences.
Critch and Krueger coin the term “prepotent AI” to refer to an AI system that is powerful enough to transform Earth’s environment at least as much as humans have and where humans cannot effectively stop or reverse these changes. Importantly, a prepotent AI need not be an artificial general intelligence.
I just listened to Andrew Critch’s interview about “AI Research Considerations for Human Existential Safety” (ARCHES). I took some notes on the podcast episode, which I’ll share here. I won’t attempt to summarize the entire episode; instead, please see this summary of the ARCHES paper in the Alignment Newsletter.
We need to explicitly distinguish between “AI existential safety” and “AI safety” writ large. Saying “AI safety” without qualification is confusing for both people who focus on near-term AI safety problems and those who focus on AI existential safety problems; it creates a bait-and-switch for both groups.
Although existential risk can refer to any event that permanently and drastically reduces humanity’s potential for future development (paraphrasing Bostrom 2013), ARCHES only deals with the risk of human extinction because it’s easier to reason about and because it’s not clear what other non-extinction outcomes are existential events.
ARCHES frames AI alignment in terms of delegation from m ≥ 1 human stakeholders (such as individuals or organizations) to n ≥ 1 AI systems. Most alignment literature to date focuses on the single-single setting (one principal, one agent), but such settings in the real world are likely to evolve into multi-principal, multi-agent settings. Computer scientists interested in AI existential safety should pay more attention to the multi-multi setting relative to the single-single one for the following reasons:
There are commercial incentives to develop AI systems that are aligned with respect to the single-single setting, but not to make sure they won’t break down in the multi-multi setting. A group of AI systems that are “aligned” with respect to single-single may still precipitate human extinction if the systems are not designed to interact well.
Single-single delegation solutions feed into AI capabilities, so focusing only on single-single delegation may increase existential risk.
What alignment means in the multi-multi setting is more ambiguous because the presence of multiple stakeholders engenders heterogeneous preferences. However, predicting whether humanity goes extinct in the multi-multi setting is easier than predicting whether a group of AI systems will “optimally” satisfy a group’s preferences.
Critch and Krueger coin the term “prepotent AI” to refer to an AI system that is powerful enough to transform Earth’s environment at least as much as humans have and where humans cannot effectively stop or reverse these changes. Importantly, a prepotent AI need not be an artificial general intelligence.