My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
Yes, I think that’s how people have used the terms historically. I think it’s also generally good usage—the specific thing you talk about in the post is important and needs its own name.
Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like “instrumental alignment” instead to avoid the general confusion with deception more broadly.
My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
Thanks for clarifying this!
Yes, I think that’s how people have used the terms historically. I think it’s also generally good usage—the specific thing you talk about in the post is important and needs its own name.
Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like “instrumental alignment” instead to avoid the general confusion with deception more broadly.
Thanks! I’ve updated both posts to reflect this.