Miles Tidmarsh comments on Community Polls on Alignment Controversies

Miles Tidmarsh 18 Jun 2026 18:40 UTC
1 point
0 ∶ 0
The intent was that, conditional on AI sharing most but not all human values, the AIs wouldn’t change their own values later.

You could have a world where all humans die and the AIs later change their own values, and you could also have worlds where partially aligned AIs don’t wipe out humanity but change their values to be better (e.g. internalizing the goal of being aligned) or worse (e.g. internalizing paperclip maximizer) by our measures.

In worlds where the first TAIs share most but not all human values, what do you think most likely happens?