Paul_Christiano comments on Deceptive Alignment is <1% Likely by Default

Paul_Christiano 22 Feb 2023 0:34 UTC
6 points
3 ∶ 0
There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly.
I think “deceptive alignment” refers only to situations where the model gets a high reward at training for instrumental reasons. This is a source of a lot of confusion (and should perhaps be called “instrumental alignment”) but worth trying to be clear about.
I might be misunderstanding what you are saying here. I think the post you link doesn’t use the term “deceptive alignment” at all so am a bit confused about the cite. (It uses the term “playing the training game” for all models that understand what is happening in training and are deliberately trying to get a low loss, which does include both deceptively aligned models and models that intrinsically value reward or something sufficiently robustly correlated.)
- DavidW 22 Feb 2023 15:30 UTC
  6 points
  1 ∶ 0
  Parent
  My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
  Thanks for clarifying this!
  - Paul_Christiano 22 Feb 2023 17:21 UTC
    5 points
    2 ∶ 0
    Parent
    Yes, I think that’s how people have used the terms historically. I think it’s also generally good usage—the specific thing you talk about in the post is important and needs its own name.
    Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like “instrumental alignment” instead to avoid the general confusion with deception more broadly.
    - DavidW 23 Feb 2023 21:44 UTC
      6 points
      1 ∶ 0
      Parent
      Thanks! I’ve updated both posts to reflect this.