There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly.
I think “deceptive alignment” refers only to situations where the model gets a high reward at training for instrumental reasons. This is a source of a lot of confusion (and should perhaps be called “instrumental alignment”) but worth trying to be clear about.
I might be misunderstanding what you are saying here. I think the post you link doesn’t use the term “deceptive alignment” at all so am a bit confused about the cite. (It uses the term “playing the training game” for all models that understand what is happening in training and are deliberately trying to get a low loss, which does include both deceptively aligned models and models that intrinsically value reward or something sufficiently robustly correlated.)
My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
Yes, I think that’s how people have used the terms historically. I think it’s also generally good usage—the specific thing you talk about in the post is important and needs its own name.
Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like “instrumental alignment” instead to avoid the general confusion with deception more broadly.
I think “deceptive alignment” refers only to situations where the model gets a high reward at training for instrumental reasons. This is a source of a lot of confusion (and should perhaps be called “instrumental alignment”) but worth trying to be clear about.
I might be misunderstanding what you are saying here. I think the post you link doesn’t use the term “deceptive alignment” at all so am a bit confused about the cite. (It uses the term “playing the training game” for all models that understand what is happening in training and are deliberately trying to get a low loss, which does include both deceptively aligned models and models that intrinsically value reward or something sufficiently robustly correlated.)
My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
Thanks for clarifying this!
Yes, I think that’s how people have used the terms historically. I think it’s also generally good usage—the specific thing you talk about in the post is important and needs its own name.
Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like “instrumental alignment” instead to avoid the general confusion with deception more broadly.
Thanks! I’ve updated both posts to reflect this.