This is a fair point, but I’m not sure why it wants to kill humans.
Like my point here is not just ‘we’ll train it out of its natural tendency to kill humans’, it’s more like ‘if we’re giving it its natural tendencies in the first place, through training, how does it get that one?’ (and there are arguments about instrumental convergence and such but I say some stuff about that in the post)
There are multiple cognitive strategies that succeed in a training regime that heavily penalizes killing humans (even just one human), such as:
avoid killing humans at all times
avoid killing humans when someone will notice
avoid killing humans during training
How do you incentivize (1)?
This is a fair point, but I’m not sure why it wants to kill humans.
Like my point here is not just ‘we’ll train it out of its natural tendency to kill humans’, it’s more like ‘if we’re giving it its natural tendencies in the first place, through training, how does it get that one?’ (and there are arguments about instrumental convergence and such but I say some stuff about that in the post)