I agree with Jaime’s answer about how alignment should avoid deception. (Catastrophic misgeneralization seems like it could fall under your alignment as capabilities argument.)
I sometimes think of alignment as something like “aligned with universal human values” more than “aligned with the specific goal of the human who programmed this model”. One might argue there aren’t a ton of universal human values. Which is correct! I’m thinking very basic stuff like, “I value there being enough breathable oxygen to support human life”.
I agree with Jaime’s answer about how alignment should avoid deception. (Catastrophic misgeneralization seems like it could fall under your alignment as capabilities argument.)
I sometimes think of alignment as something like “aligned with universal human values” more than “aligned with the specific goal of the human who programmed this model”. One might argue there aren’t a ton of universal human values. Which is correct! I’m thinking very basic stuff like, “I value there being enough breathable oxygen to support human life”.