All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.
pursuing long term goals is, by hypothesis, beneficial for immediate-term reward
Recall that in this scenario, the model is not situationally aware yet, so it can’t be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.
pursuing long term goals also entails considering the effects of future gradient updates
This isn’t necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn’t a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates.
It’s not obvious to me that your “continuity” assumption generally holds (“gradient updates have only a small impact on model behaviour”). In particular, I have an intuition that small changes in “goals” could lead to large changes in behaviour.
The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn’t yet know the effect of the upcoming gradient update. Models won’t necessarily be good at anticipating their own gradients or their own internal calculations. The effect sizes of gradient updates are hard to predict, so I would expect the model’s expectation to be much more continuous than the actual gradients. Do you agree? The difficulty of gradient prediction should also make it harder for the model to factor in the effects of gradient updates.
I think the speed at which it learns to consider how current actions affect gradient updates should depend on how much extra reward (accounting for regularisation) is available from changing in other ways.
Agreed, but I still expect that to have a limited impact if you’re looking over a relatively short-term period. It’s not guaranteed, but it’s a reasonable expectation.
One line of argument is that if changing goals is the most impactful way to improve performance, then the model must already have a highly developed understanding of the world.
It seems to me like speed of changing goals depends more on the number of differential adversarial examples and how different the reward is for them. Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn’t update the proxies.
But if it has a highly developed model of the world, then it probably already has a good “understanding of the base objective” (I use quotes here because I’m not exactly sure what this means).
If it did, that would be great! Understanding the base objective (the researchers’ training goal) early on is an important factor to prevent deceptive alignment. I agree that this is likely to happen early on, as detailed in this sequence.
When I click on the link to your first post, I am notified that I don’t have access to the draft.
Thanks for pointing that out! It should be fixed now.
All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.
Yeah, this is just partial feedback for now.
Recall that in this scenario, the model is not situationally aware yet, so it can’t be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.
I think I don’t accept your initial premise. Maybe a model acquires situational awareness via first learning about how similar models are trained for object-level reasons (maybe it’s an AI development assistant), and understanding about how these lessons apply to it’s own training via a fairly straightforward generalisation (along the lines of “other models work like this, I am a model of a similar type, maybe I work like this too”). Neither of these steps requires an improvement in loss via reasoning about its own gradient updates.
If it can be deceptive, then making the goal longer term could help because it reasons from the goal back to performing well in training, and this might be replacing a goal that didn’t quite do the right thing, but because it was short term it also didn’t care about doing well in training.
This isn’t necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn’t a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates
I agree it could go either way.
The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn’t yet know the effect of the upcoming gradient update.
I think our disagreement here boils down to what I said above: I’m imagining a model that might already be able to draw some correct conclusions about how it gets changed by training.
Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn’t update the proxies.
Right, that was wrong from me. I still think the broader conclusion is right—if goal shifting boosts performance, then it must already in some sense understand how to perform well and the goal shifting just helps it apply this knowledge. But I’m not sure if understanding how to perform well in this sense is enough to avoid deceptive alignment—that’s why I wanted to read your first post (which I still haven’t done).
Excellent, I look forward to hearing what you think of the rest of it!
I think I don’t accept your initial premise. Maybe a model acquires situational awareness via first learning about how similar models are trained for object-level reasons (maybe it’s an AI development assistant), and understanding about how these lessons apply to it’s own training via a fairly straightforward generalisation (along the lines of “other models work like this, I am a model of a similar type, maybe I work like this too”). Neither of these steps requires an improvement in loss via reasoning about its own gradient updates.
Are you talking about the model gaining situational awareness from the prompt rather than gradients? I discussed this in the second two paragraphs of the section we’re discussing. What do you think of my arguments there? My point is that a model will understand the base goal before being situational aware, not that it can’t become situationally aware at all.
If it can be deceptive, then making the goal longer term could help because it reasons from the goal back to performing well in training, and this might be replacing a goal that didn’t quite do the right thing, but because it was short term it also didn’t care about doing well in training.
My central argument is about processes through which a model gains capabilities necessary for deception. If you assume it can be deceptive, then I agree that it can be deceptive, but that’s a trivial result. Also, if the goal isn’t long-term, then the model can’t be deceptively aligned.
I think our disagreement here boils down to what I said above: I’m imagining a model that might already be able to draw some correct conclusions about how it gets changed by training.
The original post argument we’re discussing is about how situational awareness could emerge. Again, if you assume that it has situational awareness, then I agree it has situational awareness. I’m talking about how a pre-situationally aware model could become situationally aware.
Also, if the model is situationally aware, do you agree that its expectations about the effect of the gradient updates are what matters, rather than the gradient updates themselves? It might be able to make predictions that are significantly better than random, but very specific predictions about the effects of updates, including the size of the effect, would be hard, for many of the same reasons that interpretability is hard.
Right, that was wrong from me. I still think the broader conclusion is right—if goal shifting boosts performance, then it must already in some sense understand how to perform well and the goal shifting just helps it apply this knowledge. But I’m not sure if understanding how to perform well in this sense is enough to avoid deceptive alignment—that’s why I wanted to read your first post (which I still haven’t done).
Are you arguing that an aligned model could become deceptively aligned to boost training performance? Or are you saying making the goal longer-term boosts performance?
I’d be interested to hear what you think of the first post when you get a chance. Thanks for engaging with my ideas!
All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.
Recall that in this scenario, the model is not situationally aware yet, so it can’t be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.
This isn’t necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn’t a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates.
The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn’t yet know the effect of the upcoming gradient update. Models won’t necessarily be good at anticipating their own gradients or their own internal calculations. The effect sizes of gradient updates are hard to predict, so I would expect the model’s expectation to be much more continuous than the actual gradients. Do you agree? The difficulty of gradient prediction should also make it harder for the model to factor in the effects of gradient updates.
Agreed, but I still expect that to have a limited impact if you’re looking over a relatively short-term period. It’s not guaranteed, but it’s a reasonable expectation.
It seems to me like speed of changing goals depends more on the number of differential adversarial examples and how different the reward is for them. Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn’t update the proxies.
If it did, that would be great! Understanding the base objective (the researchers’ training goal) early on is an important factor to prevent deceptive alignment. I agree that this is likely to happen early on, as detailed in this sequence.
Thanks for pointing that out! It should be fixed now.
Yeah, this is just partial feedback for now.
I think I don’t accept your initial premise. Maybe a model acquires situational awareness via first learning about how similar models are trained for object-level reasons (maybe it’s an AI development assistant), and understanding about how these lessons apply to it’s own training via a fairly straightforward generalisation (along the lines of “other models work like this, I am a model of a similar type, maybe I work like this too”). Neither of these steps requires an improvement in loss via reasoning about its own gradient updates.
If it can be deceptive, then making the goal longer term could help because it reasons from the goal back to performing well in training, and this might be replacing a goal that didn’t quite do the right thing, but because it was short term it also didn’t care about doing well in training.
I agree it could go either way.
I think our disagreement here boils down to what I said above: I’m imagining a model that might already be able to draw some correct conclusions about how it gets changed by training.
Right, that was wrong from me. I still think the broader conclusion is right—if goal shifting boosts performance, then it must already in some sense understand how to perform well and the goal shifting just helps it apply this knowledge. But I’m not sure if understanding how to perform well in this sense is enough to avoid deceptive alignment—that’s why I wanted to read your first post (which I still haven’t done).
Excellent, I look forward to hearing what you think of the rest of it!
Are you talking about the model gaining situational awareness from the prompt rather than gradients? I discussed this in the second two paragraphs of the section we’re discussing. What do you think of my arguments there? My point is that a model will understand the base goal before being situational aware, not that it can’t become situationally aware at all.
My central argument is about processes through which a model gains capabilities necessary for deception. If you assume it can be deceptive, then I agree that it can be deceptive, but that’s a trivial result. Also, if the goal isn’t long-term, then the model can’t be deceptively aligned.
The original post argument we’re discussing is about how situational awareness could emerge. Again, if you assume that it has situational awareness, then I agree it has situational awareness. I’m talking about how a pre-situationally aware model could become situationally aware.
Also, if the model is situationally aware, do you agree that its expectations about the effect of the gradient updates are what matters, rather than the gradient updates themselves? It might be able to make predictions that are significantly better than random, but very specific predictions about the effects of updates, including the size of the effect, would be hard, for many of the same reasons that interpretability is hard.
Are you arguing that an aligned model could become deceptively aligned to boost training performance? Or are you saying making the goal longer-term boosts performance?
I’d be interested to hear what you think of the first post when you get a chance. Thanks for engaging with my ideas!