You raise good additional points about the dynamism and unpredictability of human values and preferences. Some of that unpredictability may reflect adaptive unpredictability (what biologists call ‘protean behavior’) that makes it harder for evolutionary enemies and rivals to predict what one’s going to do next. I discuss this issue extensively in this 1997 chapter and this 1996 simulation study. Insofar as human values are somewhat adaptively unpredictable by design, for good functional reasons, it will be very hard for reinforcement learning systems to get a good ‘fix’ on our preferences.
The other issues of adaptive self-deception (e.g. virtue signaling, as discussed in my 2019 book on the topic) about our values, and about AI power corrupting humans, also deserve much more attention in AI alignment work, IMHO.
Linyphia—totally agree (unsurprisingly!).
You raise good additional points about the dynamism and unpredictability of human values and preferences. Some of that unpredictability may reflect adaptive unpredictability (what biologists call ‘protean behavior’) that makes it harder for evolutionary enemies and rivals to predict what one’s going to do next. I discuss this issue extensively in this 1997 chapter and this 1996 simulation study. Insofar as human values are somewhat adaptively unpredictable by design, for good functional reasons, it will be very hard for reinforcement learning systems to get a good ‘fix’ on our preferences.
The other issues of adaptive self-deception (e.g. virtue signaling, as discussed in my 2019 book on the topic) about our values, and about AI power corrupting humans, also deserve much more attention in AI alignment work, IMHO.