DavidW comments on Deceptive Alignment is <1% Likely by Default

DavidW Feb 22, 2023, 3:50 PM
6 points
1 ∶ 0
Yeah, this is just partial feedback for now.
Excellent, I look forward to hearing what you think of the rest of it!
I think I don’t accept your initial premise. Maybe a model acquires situational awareness via first learning about how similar models are trained for object-level reasons (maybe it’s an AI development assistant), and understanding about how these lessons apply to it’s own training via a fairly straightforward generalisation (along the lines of “other models work like this, I am a model of a similar type, maybe I work like this too”). Neither of these steps requires an improvement in loss via reasoning about its own gradient updates.
Are you talking about the model gaining situational awareness from the prompt rather than gradients? I discussed this in the second two paragraphs of the section we’re discussing. What do you think of my arguments there? My point is that a model will understand the base goal before being situational aware, not that it can’t become situationally aware at all.
If it can be deceptive, then making the goal longer term could help because it reasons from the goal back to performing well in training, and this might be replacing a goal that didn’t quite do the right thing, but because it was short term it also didn’t care about doing well in training.
My central argument is about processes through which a model gains capabilities necessary for deception. If you assume it can be deceptive, then I agree that it can be deceptive, but that’s a trivial result. Also, if the goal isn’t long-term, then the model can’t be deceptively aligned.
I think our disagreement here boils down to what I said above: I’m imagining a model that might already be able to draw some correct conclusions about how it gets changed by training.
The original post argument we’re discussing is about how situational awareness could emerge. Again, if you assume that it has situational awareness, then I agree it has situational awareness. I’m talking about how a pre-situationally aware model could become situationally aware.
Also, if the model is situationally aware, do you agree that its expectations about the effect of the gradient updates are what matters, rather than the gradient updates themselves? It might be able to make predictions that are significantly better than random, but very specific predictions about the effects of updates, including the size of the effect, would be hard, for many of the same reasons that interpretability is hard.
Right, that was wrong from me. I still think the broader conclusion is right—if goal shifting boosts performance, then it must already in some sense understand how to perform well and the goal shifting just helps it apply this knowledge. But I’m not sure if understanding how to perform well in this sense is enough to avoid deceptive alignment—that’s why I wanted to read your first post (which I still haven’t done).
Are you arguing that an aligned model could become deceptively aligned to boost training performance? Or are you saying making the goal longer-term boosts performance?
I’d be interested to hear what you think of the first post when you get a chance. Thanks for engaging with my ideas!