The intent was that, conditional on AI sharing most but not all human values, the AIs wouldn’t change their own values later.
You could have a world where all humans die and the AIs later change their own values, and you could also have worlds where partially aligned AIs don’t wipe out humanity but change their values to be better (e.g. internalizing the goal of being aligned) or worse (e.g. internalizing paperclip maximizer) by our measures.
In worlds where the first TAIs share most but not all human values, what do you think most likely happens?
The intent was that, conditional on AI sharing most but not all human values, the AIs wouldn’t change their own values later.
You could have a world where all humans die and the AIs later change their own values, and you could also have worlds where partially aligned AIs don’t wipe out humanity but change their values to be better (e.g. internalizing the goal of being aligned) or worse (e.g. internalizing paperclip maximizer) by our measures.
In worlds where the first TAIs share most but not all human values, what do you think most likely happens?