Great post. A few months I wrote a private comment that makes a very similar point but frames it somewhat differently; I share it below in case it is of any interest.
Victoria Krakovna usefully defines the outer and inner alignment problems in terms of different “levels of specification”: the outer alignment problem is the problem of aligning the ideal specification (the goals of the designer) with the design specification (the goal implemented in the system), while the inner alignment problem is the problem of aligning this design specification with the revealed specification (the goal the system actually pursues). I think this model could be extended to define a third subcomponent of the alignment problem, next to the inner and outer alignment problems. This would be the problem of moving from what we may call the normative specification (the goals that ought to be pursued) to the ideal specification (though it would be clearer to call the latter “human specification”).
This “third alignment problem” is rarely formulated explicitly, in part because “AI alignment” is ambiguously defined to mean either “getting AI systems to do what we want them to do” and “getting AI systems to do what they ought to do”. But it seems important to distinguish between normative and human specifications, not only because (arguably) “humanity” may fail to pursue the goals it should, but also because the team of humans that succeeds in building the first AGI may not represent the goals of “humanity”. So this should be relevant both to people (like classical and negative utilitarians) with values that deviate from humanity’s in ways that could matter a lot, and to “commonsense moralists” who think we should promote human values but are concerned that AI designers may not pursue these values (because these people may not be representative members of the population, because of self-interest, or because of other reasons).
It’s unclear to me how important this third alignment problem is relative to the inner or outer alignment problems. But it seems important to be aware that it is a separate problem so that one can think about it explicitly and estimate its relative importance.
Great post. A few months I wrote a private comment that makes a very similar point but frames it somewhat differently; I share it below in case it is of any interest.
I replied to your comment in a new post here.
Thank you for the ping; I’ll take a look shortly.