I’m relatively unconvinced by most arguments I’ve read that claim deceptive alignment will be a thing (which I understand to be a model that intentionally behaves differently on its training data and test data to avoid changing its parameters in training).
Most toy examples I’ve seen, or thought experiments, don’t really seem to actually be examples of deceptive alignment since the model is actually trained on the “test” data in these examples. For example, while humans can decieve their teachers in etiquette school then use poor manners outside the school, you modify neurons in your brain on both using manners inside etiquette school and using them outside of etiquette school, so it makes sense that you would distinguish them.
I certainly think it is possible for a model to be deceptive, but this seems to be much, much more complicated and harder-to-find in the gradient descent landscape, and it seems to me that there’s pretty much no reason for a model to learn to be deceptive in the first place. This makes it seem like it won’t come up at all in practice, or at the least will be very easy to avoid.
Why do people buy deceptive alignment? Could you give some concrete examples where it could come up?
Figuring out that you’re an ML agent being trained by RL is very difficult, and probably beyond the reach of current capabilities, which is why we don’t see “real” deceptive alignment so far.
However, it’s in principle possible, and helped by the fact that your training data probably includes information about what RL is etc. Seems like an AGI could do it.
Explicitly modelling your reward function seems like it would be really helpful for getting more reward, so RL agents might be pushed in the direction of this explicit modelling
Once you have “optimize my reward function” as an instrumental goal, you’re pretty much as good at getting reward as any other agent that has that as an instrumental goal, regardless of what your terminal goal is, so at this point you may appear to be aligned while in reality your true terminal goals are just no longer directing your behaviour for as long as you believe you’re in training.
It’s hard to really comprehensively hide from the agent whether it’s in training or not, so it should be relatively easy for the agent to optimize for reward in training and then do something else when deployed.
Thanks! I think most of this made sense to me. I’m a bit fuzzy on the fourth bullet. Also, I’m still confused why a model would even develop an alternative goal to maximizing its reward function, even if it’s theoretically able to pursue one.
I’m relatively unconvinced by most arguments I’ve read that claim deceptive alignment will be a thing (which I understand to be a model that intentionally behaves differently on its training data and test data to avoid changing its parameters in training).
Most toy examples I’ve seen, or thought experiments, don’t really seem to actually be examples of deceptive alignment since the model is actually trained on the “test” data in these examples. For example, while humans can decieve their teachers in etiquette school then use poor manners outside the school, you modify neurons in your brain on both using manners inside etiquette school and using them outside of etiquette school, so it makes sense that you would distinguish them.
I certainly think it is possible for a model to be deceptive, but this seems to be much, much more complicated and harder-to-find in the gradient descent landscape, and it seems to me that there’s pretty much no reason for a model to learn to be deceptive in the first place. This makes it seem like it won’t come up at all in practice, or at the least will be very easy to avoid.
Why do people buy deceptive alignment? Could you give some concrete examples where it could come up?
In my mind this rests on:
Figuring out that you’re an ML agent being trained by RL is very difficult, and probably beyond the reach of current capabilities, which is why we don’t see “real” deceptive alignment so far.
However, it’s in principle possible, and helped by the fact that your training data probably includes information about what RL is etc. Seems like an AGI could do it.
Explicitly modelling your reward function seems like it would be really helpful for getting more reward, so RL agents might be pushed in the direction of this explicit modelling
Once you have “optimize my reward function” as an instrumental goal, you’re pretty much as good at getting reward as any other agent that has that as an instrumental goal, regardless of what your terminal goal is, so at this point you may appear to be aligned while in reality your true terminal goals are just no longer directing your behaviour for as long as you believe you’re in training.
It’s hard to really comprehensively hide from the agent whether it’s in training or not, so it should be relatively easy for the agent to optimize for reward in training and then do something else when deployed.
Thanks! I think most of this made sense to me. I’m a bit fuzzy on the fourth bullet. Also, I’m still confused why a model would even develop an alternative goal to maximizing its reward function, even if it’s theoretically able to pursue one.
I have a couple of videos that talk about this! This one sets up the general idea:
This one talks about how like this is to happen in practice: