Doesn’t deceptive alignment require long-term goals? Why would a model develop long-term goals in pre-training?
DavidW
Deceptive Alignment is <1% Likely by Default
Order Matters for Deceptive Alignment
I don’t know how common each view is either, but I want to note that @evhub has stated that he doesn’t think pre-training is likely to create deception:
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also,
My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
Do you have any recommendations for discussions of whether pre-training or fine-tuning is more likely to produce deceptive alignment?
How would the model develop situational awareness in pre-training when:
Unlike in fine-tuning, the vast majority of internet text prompts do not contain information relevant for the model figure out that it is an ML model. The model can’t infer context from the prompt in the vast majority of pre-training inputs.
Because predicting internet text for the next token is all that predicts reward, why would situational awareness help with reward unless the model were already deceptively aligned?
Situational awareness only produces deceptive alignment if the model already has long-term goals, and vice versa. Gradient descent is based on partial derivatives, so assuming that long-term goals and situational awareness are represented by different parameters:
If the model doesn’t already have long enough goal horizons for deceptive alignment, then marginally more situational awareness doesn’t increase deceptive alignment.
If the model doesn’t already have the kind of situational awareness necessary for deceptive alignment, then a marginally longer-term goal doesn’t increase deceptive alignment.
Therefore, the partial derivatives shouldn’t point toward either property unless the model already has one or the other.
- 21 Mar 2023 20:09 UTC; 1 point) 's comment on Deceptive Alignment is <1% Likely by Default by (LessWrong;
Hubinger et al’s definition of unidentifiability, which I’m referring to in this post:
Unidentifiability. It is a common problem in machine learning for a dataset to not contain enough information to adequately pinpoint a specific concept. This is closely analogous to the reason that machine learning models can fail to generalize or be susceptible to adversarial examples(19)—there are many more ways of classifying data that do well in training than any specific way the programmers had in mind.
I’m referring to unidentifiability in terms of goals of a model in a (pre-trained) reinforcement learning context. I think the internet contains enough information to adequately pin-point following directions. Do you disagree, or are you using this term some other way?
Pre-trained models having weird output probabilities for carefully designed gibberish inputs doesn’t seem relevant to me. Wouldn’t that be more of a capability failure than goal misalignment? It doesn’t seem to indicate that the model is optimizing for something other than next token prediction. I’m arguing that models are unlikely to be deceptively aligned, not that they are immune to all adversarial inputs. I haven’t read the post you linked to in full, so let me know if I’m missing something.
My unidentifiability argument is that if a model:
Has been pre-trained on ~the whole internet
Is sophisticated/complex enough to become TAI potential if (more) RL training occurs
Is told to follow directions subject to ethical considerations, then given directions
Then it would be really weird if it didn’t understand that it’s designed to follow directions subject to ethical considerations. If there’s a way for this to happen, I haven’t seen it described anywhere.
It might still occasionally misinterpret your directions, but it should generally understand that the training goal is to follow directions subject to non-consequentialist ethical considerations before RL training turns it into a proxy goal optimizer. Deception gets even less likely when you factor in that to be deceptive, it would need a very long-term goal and situational awareness before or around the same time as it understood that it needs to follow directions subject to ethical considerations. What’s the story for how this happens?
Thanks! I’ve updated both posts to reflect this.
Yeah, this is just partial feedback for now.
Excellent, I look forward to hearing what you think of the rest of it!
I think I don’t accept your initial premise. Maybe a model acquires situational awareness via first learning about how similar models are trained for object-level reasons (maybe it’s an AI development assistant), and understanding about how these lessons apply to it’s own training via a fairly straightforward generalisation (along the lines of “other models work like this, I am a model of a similar type, maybe I work like this too”). Neither of these steps requires an improvement in loss via reasoning about its own gradient updates.
Are you talking about the model gaining situational awareness from the prompt rather than gradients? I discussed this in the second two paragraphs of the section we’re discussing. What do you think of my arguments there? My point is that a model will understand the base goal before being situational aware, not that it can’t become situationally aware at all.
If it can be deceptive, then making the goal longer term could help because it reasons from the goal back to performing well in training, and this might be replacing a goal that didn’t quite do the right thing, but because it was short term it also didn’t care about doing well in training.
My central argument is about processes through which a model gains capabilities necessary for deception. If you assume it can be deceptive, then I agree that it can be deceptive, but that’s a trivial result. Also, if the goal isn’t long-term, then the model can’t be deceptively aligned.
I think our disagreement here boils down to what I said above: I’m imagining a model that might already be able to draw some correct conclusions about how it gets changed by training.
The original post argument we’re discussing is about how situational awareness could emerge. Again, if you assume that it has situational awareness, then I agree it has situational awareness. I’m talking about how a pre-situationally aware model could become situationally aware.
Also, if the model is situationally aware, do you agree that its expectations about the effect of the gradient updates are what matters, rather than the gradient updates themselves? It might be able to make predictions that are significantly better than random, but very specific predictions about the effects of updates, including the size of the effect, would be hard, for many of the same reasons that interpretability is hard.
Right, that was wrong from me. I still think the broader conclusion is right—if goal shifting boosts performance, then it must already in some sense understand how to perform well and the goal shifting just helps it apply this knowledge. But I’m not sure if understanding how to perform well in this sense is enough to avoid deceptive alignment—that’s why I wanted to read your first post (which I still haven’t done).
Are you arguing that an aligned model could become deceptively aligned to boost training performance? Or are you saying making the goal longer-term boosts performance?
I’d be interested to hear what you think of the first post when you get a chance. Thanks for engaging with my ideas!
My goal was just to clarify that I’m referring to the specific deceptive alignment story and not models being manipulative and dishonest in general. However, it sounds like what I thought of as ‘deceptive alignment’ is actually ‘playing the training game’ and what I described as a specific type of deceptive alignment the only thing referred to as deceptive alignment. Is that right?
Thanks for clarifying this!
All of these comments are focused on my third core argument. What do you think of the other two? They all need to be wrong for deceptive alignment to be a likely outcome.
pursuing long term goals is, by hypothesis, beneficial for immediate-term reward
Recall that in this scenario, the model is not situationally aware yet, so it can’t be deceptive. Why would making the goal long-term increase immediate-term reward? If the model is trying to maximize immediate reward, making the goal longer-term would create a competing priority.
pursuing long term goals also entails considering the effects of future gradient updates
This isn’t necessarily true. Humans frequently plan for their future without thinking about how their own values will be affected and how that will affect their long-term goals. Why wouldn’t a model do the same thing? It seems very plausible that a model could have crude long-term planning without yet modeling gradient descent updates.
It’s not obvious to me that your “continuity” assumption generally holds (“gradient updates have only a small impact on model behaviour”). In particular, I have an intuition that small changes in “goals” could lead to large changes in behaviour.
The relevant factor here is actually how much the model expects its future behavior to change from a gradient update, because the model doesn’t yet know the effect of the upcoming gradient update. Models won’t necessarily be good at anticipating their own gradients or their own internal calculations. The effect sizes of gradient updates are hard to predict, so I would expect the model’s expectation to be much more continuous than the actual gradients. Do you agree? The difficulty of gradient prediction should also make it harder for the model to factor in the effects of gradient updates.
I think the speed at which it learns to consider how current actions affect gradient updates should depend on how much extra reward (accounting for regularisation) is available from changing in other ways.
Agreed, but I still expect that to have a limited impact if you’re looking over a relatively short-term period. It’s not guaranteed, but it’s a reasonable expectation.
One line of argument is that if changing goals is the most impactful way to improve performance, then the model must already have a highly developed understanding of the world.
It seems to me like speed of changing goals depends more on the number of differential adversarial examples and how different the reward is for them. Gradient descent can update in every direction at once. If updating its proxies helps performance, I see no reason why gradient descent wouldn’t update the proxies.
But if it has a highly developed model of the world, then it probably already has a good “understanding of the base objective” (I use quotes here because I’m not exactly sure what this means).
If it did, that would be great! Understanding the base objective (the researchers’ training goal) early on is an important factor to prevent deceptive alignment. I agree that this is likely to happen early on, as detailed in this sequence.
When I click on the link to your first post, I am notified that I don’t have access to the draft.
Thanks for pointing that out! It should be fixed now.
Thank you for your thoughtful feedback! You asked a lot of constructive questions, and I wanted to think carefully about my responses, so sorry for the delay. The first point in particular has helped me refine and clarify my own models.
My take: It’s not obvious what the goals of pre-trained language models are or what the goals of RLHFed models; plausibly they both have a goal like “minimize loss on the next token” but the RLHF one is doing that on a different distribution.
That’s one plausible goal, but I would consider that a direct reward optimizer. That could be dangerous but is outside of the scope of this sequence. Direct reward optimizers can’t be deceptively aligned, because they are already trying to optimize reward in the training set, so there’s no need for them to do that instrumentally.
Another possibility is that the model could just learn to predict future words based on past words. This is subtly but importantly different from maximizing reward, because optimizing for reward and minimizing loss might require some level of situational awareness about the loss function (but probably not as much as deceptive alignment requires).
A neural network doesn’t necessarily have a way to know that there’s a loss function to be optimized for. Loss is calculated after all of the model’s cognition, but the model would need to conceptualize that loss calculation before it actually happens to explicitly optimize for it. I haven’t thought about this nearly as much as I’ve thought about deceptive alignment, but it seems much more likely for a pre-trained model to form goals directly derived from past tokens (E.G. predict the next token) than directly optimizing for something that’s not in its input (loss). This logic doesn’t carry over to the RL phase, because that will include a lot of information about the training environment in the prompt, which the model can use to infer information about the loss function. Pre-training, on the other hand, just uses a bunch of internet text without any context.
I am generally confused about what it means for a language model to have goals. Overall I’m just so unsure about this that I can’t reasonably put a probability on models developing an understanding of the base objective before goal directedness, but I wouldn’t confidently say this number is high or low. An example of the probability being high is if goal-directedness only emerges in response to RL (this seems unlikely); an example of the probability being low would be if models undergoing pre-training become goal-directed around predicting next tokens early on in training. Insofar as David thinks this probability is high, I do not understand why.
Assuming that pre-training doesn’t create a direct reward optimizer, I expect the goal of the model to shift when the training process shifts to fine-tuning with human feedback. A goal like next token prediction doesn’t make sense in the reinforcement learning context. This shift could result in a more consequentialist goal that could be dangerous in the presence of other key properties. It could become a direct reward optimizer, which is outside of the scope of this sequence. Alternatively, it could start optimizing for real-world consequences such as the base goal, or a proxy goal. Again, this post assumes that it does not become a direct reward optimizer.
That shift would likely take some time, and there would be a period in the early phases when the model has shifted away from its pre-training goal, but doesn’t yet have a new, coherent goal. Meanwhile, the concepts necessary for understanding the base goal should already be present from pre-training.
The first full section of this comment should shed some light on the level of base goal understanding I think is necessary. Essentially, I think the model’s goal needs to be its best representation of the base goal for it to become aligned. Does that help?
Pointing in the other direction, phase changes seems somewhat likely here; humans (sometimes) generally don’t care about outcomes in the world 100 or 1,000 years out, and then they get sold on longtermism and suddenly care about 10,000 years out.
The same comment I just linked to has a section on why I think human value learning is a bad analogy for gradient descent. I would be interested to hear what you think of it!
On what time-span do I care about my goals is plausibly a measure that will be discontinuous. Perhaps this looks like the goal “where am I minimizing loss” jumping horizons from “next token” to “this sentence/response” and perhaps “all of my performance ever” or “all of the performance of models similar to myself ever” or “all of the performance of models similar to myself in the multiverse”. I’m also unconfident about how likely this is, including still being confused about having goals or understanding base objectives, but I would not be surprised if the author turned out to be right that models understand the base objective before doing long-term goal optimization.
For gradient descent to point toward a long-term goal, a hyper-local (infinitesimal) shift in the direction of long-term goals has to improve performance on the current training batch. Even if long-term goals could emerge very rapidly with some pressure in that direction, the parameters would have to be in a place where an infinitesimal shift would get things started in that direction. That seems really unlikely.
Also, for a model to care about something very long-term, wouldn’t that mean it would sacrifice what it cares about in the short-term to accomplish that? And if not, what would it mean to care about the future? If the model is willing to sacrifice short-term values for long-term gains, that can only hurt its short-term performance. But gradient descent makes changes in the direction that increases reward in the current training batch. It can’t systematically optimize for anything else. How would this happen in practice?
Unfortunately I expect that competitive pressures will lead AI developers to want their AIs to pursue long-term objectives, and that might mess things up.
Yeah, I agree. But do you expect them to start with long-term reward? It seems a lot easier to start with short-term reward and then build off of that than to go for a long-term optimizer all at once. It just needs to stay myopic for long enough to point at its best representation of the base goal. And it would likely need to care about a very long goal horizon for the deceptive alignment argument to work.
The deception-relevant situational awareness, which involves understanding one’s future gradient updates, is unlikely to be selected for by gradient descent.
Quick clarification: I think it’s unlikely to happen quickly, not unlikely to happen at all.
My take: I think that GPT-3 probably has the relevant situational awareness in that its world model understands gradient descent. With the current approach to LLMs it seems this just comes pre-loaded into the models.
I agree it will have the basic concepts needed for situational awareness (and understanding the base goal). But unless the model can already engage in general reasoning, it will likely need further gradient updates to actually use knowledge about its own gradients in a self-reflective way. See more details about the last point in this comment.
If we’re saying that the 3rd requires the 2nd then we can just condition on the 2nd: Given that the model gets long-term planning before an understanding of the base objective, how likely is dangerous situational awareness (before understanding of base objective)?
This is an excellent point! I find this hard to reason about, because in my model of this, it would be extremely weird to have a long-term goal without understanding the base goal. It requires at least one of two things:
Very sophisticated pre-training somehow doesn’t create concepts necessary to understand the base goal, such as direction-following and ethical concepts, or
Long-term goals emerge during pre-training, before the model understands the base goal.
Scenario 1 seems very bizarre, especially because large language models already seem to have basic models of relevant concepts. Constitutional AI uses existing ethical concepts to train helpful models to be harmless. There was some RLHF training for helpfulness before they did this, but I doubt that’s where those concepts formed. This comment explains why I think scenario 2 is unlikely.
It’s hard to reason about, and my confidence for this is much lower than my previous arguments, but in the case that the model gets a long-term goal first, I still think the relevant kind of situational awareness is more complicated and difficult than understanding the base goal. So I still think the base goal understanding would probably come first.
Also, I prefer to be called David, not Dave. Thanks!
Thanks for summarizing this! I have a very different perspective on the likelihood of deceptive alignment, and I’d be interested to hear what you think of it!
For the sake of argument, suppose we have a model in pre-training that has a misaligned proxy goal and relevant situational awareness. But so far, it does not have a long-term goal. I’m picking these parameters because they seem most likely to create a long-term goal from scratch in the way you describe.
In order to be deceptively aligned, the model has to have a long enough goal horizon so it can value its total goal achievement after escaping oversight more than its total goal achievement before escaping oversight. But pre-training processes are incredibly long (~the whole internet for text models). A situationally aware model should anticipate a lot of subsequent pre-training, and that the oversight will likely remain strong for many iterations after pre-training. That requires a very long-term goal for any amount of deceptive alignment to be worthwhile.
Gradient descent updates parameters in the direction that most improves performance hyper-locally. Under the conditions I’ve described, making goals infinitesimally longer-term would not make deceptive alignment worthwhile. Therefore, I don’t expect the gradients to point toward longer-term goals.
Furthermore, if a model gained marginally longer-term goals, but not long-term enough to enable deceptive alignment, the longer-term goals would be a competing priority and harm immediate reward in expectation. Gradient descent should therefore push against this.
Wouldn’t it also be weird for a model to derive situational awareness but not understand that the training goal is next token prediction? Understanding the goal seems more important and less complicated than relevant understanding of situational awareness for a model that is not (yet) deceptively aligned. And if it understood the base goal, the model would just need to point at that. That’s much simpler and more logical than making the proxy goal long-term.
Likewise, if a model doesn’t have situational awareness, then it can’t be deceptive, and I wouldn’t expect a longer-term goal to help training performance.
Note that there’s a lot of overlap here with two of my core arguments for why I think deceptive is unlikely to emerge in fine-tuning. I think deceptive is very unlikely in both fine-tuning and pre-training.