DavidW comments on Deceptive Alignment is <1% Likely by Default

DavidW Feb 22, 2023, 3:30 PM
6 points
1 ∶ 0
My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
Thanks for clarifying this!
- Paul_Christiano Feb 22, 2023, 5:21 PM
  5 points
  2 ∶ 0
  Parent
  Yes, I think that’s how people have used the terms historically. I think it’s also generally good usage—the specific thing you talk about in the post is important and needs its own name.
  Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like “instrumental alignment” instead to avoid the general confusion with deception more broadly.
  - DavidW Feb 23, 2023, 9:44 PM
    6 points
    1 ∶ 0
    Parent
    Thanks! I’ve updated both posts to reflect this.