This was a particularly informative podcast, and you helped me get a better understanding of inner alignment issues, which I really appreciate.
To be clear I understand: the issue with inner alignment is that as an agent gets optimized for a reward/cost function on a training distribution, and to do well the agent needs to have a good enough world model to determine that it’s in or could be undergoing training, then if the training ends up creating an optimizer, it’s much more likely that that optimizer’s reward function is bad or a proxy, and if it’s sufficiently intelligent, it’ll reason that it should figure out what you want it to do, and do that. This is because there are many different bad reward functions an inner optimizer can have, but only one that you want it to actually have, and each of those bad reward functions will pretend to have the good one.
Although the badly-aligned agents seem like they’d at least be optimizing for proxies of what you actually want, as early (dumber) agents with unrelated utility functions wouldn’t do as well as alternative agents with approximately aligned utility functions.
Correct me on any mistakes please.
Also, because this depends on the agents being at least a little generally intelligent, I’m guessing there are no contemporary examples of such inner optimizers attempting deception.
So, I think what you’re describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.
In terms of empirical examples, there definitely aren’t good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don’t really know whether our models are internally implementing optimization processes or not, it’s hard to really say whether we’re actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).
This was a particularly informative podcast, and you helped me get a better understanding of inner alignment issues, which I really appreciate.
To be clear I understand: the issue with inner alignment is that as an agent gets optimized for a reward/cost function on a training distribution, and to do well the agent needs to have a good enough world model to determine that it’s in or could be undergoing training, then if the training ends up creating an optimizer, it’s much more likely that that optimizer’s reward function is bad or a proxy, and if it’s sufficiently intelligent, it’ll reason that it should figure out what you want it to do, and do that. This is because there are many different bad reward functions an inner optimizer can have, but only one that you want it to actually have, and each of those bad reward functions will pretend to have the good one.
Although the badly-aligned agents seem like they’d at least be optimizing for proxies of what you actually want, as early (dumber) agents with unrelated utility functions wouldn’t do as well as alternative agents with approximately aligned utility functions.
Correct me on any mistakes please.
Also, because this depends on the agents being at least a little generally intelligent, I’m guessing there are no contemporary examples of such inner optimizers attempting deception.
Glad you enjoyed it!
So, I think what you’re describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.
In terms of empirical examples, there definitely aren’t good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don’t really know whether our models are internally implementing optimization processes or not, it’s hard to really say whether we’re actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).