Trouble is, (1) the rationality assumption is demonstrably false, (2) there’s no reason for human groups to agree to aggregate their preferences in this way—any more than they’d be willing to dissolve their nation-states and hand unlimited power over to a United Nations that promises to use Harsanyi’s theorem fairly and incorruptibly.
Yes, we could try to align AI with some kind of lowest-common-denominator aggregated human (or mammal, or vertebrate) preferences. But if most humans would not be happy with that strategy, it’s a non-starter for solving alignment.
Ben—thanks for the reminder about Harsanyi.
Trouble is, (1) the rationality assumption is demonstrably false, (2) there’s no reason for human groups to agree to aggregate their preferences in this way—any more than they’d be willing to dissolve their nation-states and hand unlimited power over to a United Nations that promises to use Harsanyi’s theorem fairly and incorruptibly.
Yes, we could try to align AI with some kind of lowest-common-denominator aggregated human (or mammal, or vertebrate) preferences. But if most humans would not be happy with that strategy, it’s a non-starter for solving alignment.