I’ve heard this argument several times, that once we figure out how to align AI with the values of any sentient being, the rest of AI alignment with all the other billions/trillions of different sentient beings will be trivially easy.
One perhaps obvious point: if you make some rationality assumptions, there is a single unique solution to how those preferences should be aggregated. So if you are able to align an AI with a single individual, you can iterate this alignment with all the individuals and use Harsanyi’s theorem to aggregate their preferences.
This (assuming rationality) is the uniquely best method to aggregate preferences.
There are criticisms to be made of this solution, but it at least seems reasonable, and I don’t think there’s an analogous simple “reasonably good” solution to aligning AI with an individual.
Trouble is, (1) the rationality assumption is demonstrably false, (2) there’s no reason for human groups to agree to aggregate their preferences in this way—any more than they’d be willing to dissolve their nation-states and hand unlimited power over to a United Nations that promises to use Harsanyi’s theorem fairly and incorruptibly.
Yes, we could try to align AI with some kind of lowest-common-denominator aggregated human (or mammal, or vertebrate) preferences. But if most humans would not be happy with that strategy, it’s a non-starter for solving alignment.
One perhaps obvious point: if you make some rationality assumptions, there is a single unique solution to how those preferences should be aggregated. So if you are able to align an AI with a single individual, you can iterate this alignment with all the individuals and use Harsanyi’s theorem to aggregate their preferences.
This (assuming rationality) is the uniquely best method to aggregate preferences.
There are criticisms to be made of this solution, but it at least seems reasonable, and I don’t think there’s an analogous simple “reasonably good” solution to aligning AI with an individual.
Ben—thanks for the reminder about Harsanyi.
Trouble is, (1) the rationality assumption is demonstrably false, (2) there’s no reason for human groups to agree to aggregate their preferences in this way—any more than they’d be willing to dissolve their nation-states and hand unlimited power over to a United Nations that promises to use Harsanyi’s theorem fairly and incorruptibly.
Yes, we could try to align AI with some kind of lowest-common-denominator aggregated human (or mammal, or vertebrate) preferences. But if most humans would not be happy with that strategy, it’s a non-starter for solving alignment.