JakubK comments on Where I’m at with AI risk: convinced of danger but not (yet) of doom

JakubK 25 Mar 2023 6:12 UTC
4 points
0 ∶ 0
Naively, it seems as if killing everyone would earn AI a massive penalty in training: why would it develop aims that are consistent with doing that?
There are multiple cognitive strategies that succeed in a training regime that heavily penalizes killing humans (even just one human), such as:
1. avoid killing humans at all times
2. avoid killing humans when someone will notice
3. avoid killing humans during training
How do you incentivize (1)?
- Amber Dawn 3 Apr 2023 15:01 UTC
  2 points
  0 ∶ 0
  Parent
  This is a fair point, but I’m not sure why it wants to kill humans.
  
  Like my point here is not just ‘we’ll train it out of its natural tendency to kill humans’, it’s more like ‘if we’re giving it its natural tendencies in the first place, through training, how does it get that one?’ (and there are arguments about instrumental convergence and such but I say some stuff about that in the post)