Youâre talking about outer-alignment failure, but Iâm concerned about inner-alignment failure. These are different problems: outer-alignment failure is like a tricky genie misinterpreting your wish, while inner-alignment failure involves the AI developing its own unexpected goals.
RLHF doesnât optimize for âhuman preferenceâ in general. It only optimizes for specific reward signals based on limited human feedback in controlled settings. The aspects of reality not captured by this process can become proxy goals that work fine in training environments but fail to generalize to new situations. Generalization might happen by chance, but it becomes less likely as complexity increases.
An AI getting perfect human approval during training doesnât solve the inner-alignment problem if circumstances change significantlyâlike when the AI gains more control over its environment than it had during training.
Weâve already seen this pattern with humans and evolution. Humans became âmisalignedâ with evolutionâs goal of reproduction because we were optimized for proxy rewards (pleasure/âpain) rather than reproduction directly. When we gained more environmental control through technology, these proxy rewards led to unexpected outcomes: we invented contraception, developed preferences for junk food, and seek thrilling but dangerous experiencesâall contrary to evolutionâs original âgoalâ of maximizing reproduction.
Thank you for the links. The concerning scenario I imagine is an AI performing something like reflective equilibrium and coming away with something singular and overly-reductive, biting bullets weâd rather it not, all for the sake of coherence. I donât think current LLM systems are doing this, but greater coherence seems generally useful, so I expect AI companies to seek it. I will read these and try to see if something like this is addressed.