So, I think what you’re describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.
In terms of empirical examples, there definitely aren’t good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don’t really know whether our models are internally implementing optimization processes or not, it’s hard to really say whether we’re actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).
Glad you enjoyed it!
So, I think what you’re describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.
In terms of empirical examples, there definitely aren’t good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don’t really know whether our models are internally implementing optimization processes or not, it’s hard to really say whether we’re actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).