Figuring out that youâre an ML agent being trained by RL is very difficult, and probably beyond the reach of current capabilities, which is why we donât see ârealâ deceptive alignment so far.
However, itâs in principle possible, and helped by the fact that your training data probably includes information about what RL is etc. Seems like an AGI could do it.
Explicitly modelling your reward function seems like it would be really helpful for getting more reward, so RL agents might be pushed in the direction of this explicit modelling
Once you have âoptimize my reward functionâ as an instrumental goal, youâre pretty much as good at getting reward as any other agent that has that as an instrumental goal, regardless of what your terminal goal is, so at this point you may appear to be aligned while in reality your true terminal goals are just no longer directing your behaviour for as long as you believe youâre in training.
Itâs hard to really comprehensively hide from the agent whether itâs in training or not, so it should be relatively easy for the agent to optimize for reward in training and then do something else when deployed.
Thanks! I think most of this made sense to me. Iâm a bit fuzzy on the fourth bullet. Also, Iâm still confused why a model would even develop an alternative goal to maximizing its reward function, even if itâs theoretically able to pursue one.
In my mind this rests on:
Figuring out that youâre an ML agent being trained by RL is very difficult, and probably beyond the reach of current capabilities, which is why we donât see ârealâ deceptive alignment so far.
However, itâs in principle possible, and helped by the fact that your training data probably includes information about what RL is etc. Seems like an AGI could do it.
Explicitly modelling your reward function seems like it would be really helpful for getting more reward, so RL agents might be pushed in the direction of this explicit modelling
Once you have âoptimize my reward functionâ as an instrumental goal, youâre pretty much as good at getting reward as any other agent that has that as an instrumental goal, regardless of what your terminal goal is, so at this point you may appear to be aligned while in reality your true terminal goals are just no longer directing your behaviour for as long as you believe youâre in training.
Itâs hard to really comprehensively hide from the agent whether itâs in training or not, so it should be relatively easy for the agent to optimize for reward in training and then do something else when deployed.
Thanks! I think most of this made sense to me. Iâm a bit fuzzy on the fourth bullet. Also, Iâm still confused why a model would even develop an alternative goal to maximizing its reward function, even if itâs theoretically able to pursue one.