[Question] Could someone help me understand why it’s so difficult to solve the alignment problem?

AGI will be able to model human language and psychology very accurately. Given that, isn’t alignment easy if you train the AGI to interpret linguistic prompts in the way that the “average” human would? (I know language doesn’t encode an exact meaning, but for any chunk of text, there does exist a distribution of ways that humans interpret it.)

Thus, on its face, inner alignment seems fairly doable. But apparently, according to RobBesinger, “We don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’ … [or even] simpler physical systems.” Why is this so difficult? Does there exist an argument that it is impossible?

Outer alignment doesn’t seem very difficult to me, either. Here’s a prompt I thought of: “Do not do an action if anyone in a specified list of philosophers, intellectuals, members of the public, etc. would prefer you not do it, if they had all relevant knowledge of the action and its effects beforehand, consistent with the human legal standard of informed consent.” Wouldn’t this prompt (in its ideal form, not exactly as I wrote it) guard against many bad actions, including power-seeking behavior?

Thank you for the help!