All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.
pursuing long term goals is, by hypothesis, beneficial for immediate-term reward
Recall that in this scenario, the model is not situationally aware yet, so it can’t be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.
pursuing long term goals also entails considering the effects of future gradient updates
This isn’t necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn’t a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates.
It’s not obvious to me that your “continuity” assumption generally holds (“gradient updates have only a small impact on model behaviour”). In particular, I have an intuition that small changes in “goals” could lead to large changes in behaviour.
The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn’t yet know the effect of the upcoming gradient update. Models won’t necessarily be good at anticipating their own gradients or their own internal calculations. The effect sizes of gradient updates are hard to predict, so I would expect the model’s expectation to be much more continuous than the actual gradients. Do you agree? The difficulty of gradient prediction should also make it harder for the model to factor in the effects of gradient updates.
I think the speed at which it learns to consider how current actions affect gradient updates should depend on how much extra reward (accounting for regularisation) is available from changing in other ways.
Agreed, but I still expect that to have a limited impact if you’re looking over a relatively short-term period. It’s not guaranteed, but it’s a reasonable expectation.
One line of argument is that if changing goals is the most impactful way to improve performance, then the model must already have a highly developed understanding of the world.
It seems to me like speed of changing goals depends more on the number of differential adversarial examples and how different the reward is for them. Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn’t update the proxies.
But if it has a highly developed model of the world, then it probably already has a good “understanding of the base objective” (I use quotes here because I’m not exactly sure what this means).
If it did, that would be great! Understanding the base objective (the researchers’ training goal) early on is an important factor to prevent deceptive alignment. I agree that this is likely to happen early on, as detailed in this sequence.
When I click on the link to your first post, I am notified that I don’t have access to the draft.
Thanks for pointing that out! It should be fixed now.
Hubinger et al’s definition of unidentifiability, which I’m referring to in this post:
I’m referring to unidentifiability in terms of goals of a model in a (pre-trained) reinforcement learning context. I think the internet contains enough information to adequately pin-point following directions. Do you disagree, or are you using this term some other way?
Pre-trained models having weird output probabilities for carefully designed gibberish inputs doesn’t seem relevant to me. Wouldn’t that be more of a capability failure than goal misalignment? It doesn’t seem to indicate that the model is optimizing for something other than next token prediction. I’m arguing that models are unlikely to be deceptively aligned, not that they are immune to all adversarial inputs. I haven’t read the post you linked to in full, so let me know if I’m missing something.
My unidentifiability argument is that if a model:
Has been pre-trained on ~the whole internet
Is sophisticated/complex enough to become TAI potential if (more) RL training occurs
Is told to follow directions subject to ethical considerations, then given directions
Then it would be really weird if it didn’t understand that it’s designed to follow directions subject to ethical considerations. If there’s a way for this to happen, I haven’t seen it described anywhere.
It might still occasionally misinterpret your directions, but it should generally understand that the training goal is to follow directions subject to non-consequentialist ethical considerations before RL training turns it into a proxy goal optimizer. Deception gets even less likely when you factor in that to be deceptive, it would need a very long-term goal and situational awareness before or around the same time as it understood that it needs to follow directions subject to ethical considerations. What’s the story for how this happens?