I think the idea is that it will only change its values in a particular direction if that helped it realise its current values. So it won’t changes its values if doing so would mean that it would do horrible things according to its current values. A philosophical thing lurking in the background is that you can’t work out the correct values just by good thinking, rather basic starting values are thinking-independent, as long as your consistent: no amount of intelligence and reasoning will make you arrive at the correct ones. (They call this the “orthagonality thesis”, but a similar idea is known in academic philosophy as Humeanism about moral motivation. It’s quite mainstream but not without its critics).
I think the idea is that it will only change its values in a particular direction if that helped it realise its current values. So it won’t changes its values if doing so would mean that it would do horrible things according to its current values. A philosophical thing lurking in the background is that you can’t work out the correct values just by good thinking, rather basic starting values are thinking-independent, as long as your consistent: no amount of intelligence and reasoning will make you arrive at the correct ones. (They call this the “orthagonality thesis”, but a similar idea is known in academic philosophy as Humeanism about moral motivation. It’s quite mainstream but not without its critics).