Thank you for your thoughtful feedback! You asked a lot of constructive questions, and I wanted to think carefully about my responses, so sorry for the delay. The first point in particular has helped me refine and clarify my own models.
My take: It’s not obvious what the goals of pre-trained language models are or what the goals of RLHFed models; plausibly they both have a goal like “minimize loss on the next token” but the RLHF one is doing that on a different distribution.
That’s one plausible goal, but I would consider that a direct reward optimizer. That could be dangerous but is outside of the scope of this sequence. Direct reward optimizers can’t be deceptively aligned, because they are already trying to optimize reward in the training set, so there’s no need for them to do that instrumentally.
Another possibility is that the model could just learn to predict future words based on past words. This is subtly but importantly different from maximizing reward, because optimizing for reward and minimizing loss might require some level of situational awareness about the loss function (but probably not as much as deceptive alignment requires).
A neural network doesn’t necessarily have a way to know that there’s a loss function to be optimized for. Loss is calculated after all of the model’s cognition, but the model would need to conceptualize that loss calculation before it actually happens to explicitly optimize for it. I haven’t thought about this nearly as much as I’ve thought about deceptive alignment, but it seems much more likely for a pre-trained model to form goals directly derived from past tokens (E.G. predict the next token) than directly optimizing for something that’s not in its input (loss). This logic doesn’t carry over to the RL phase, because that will include a lot of information about the training environment in the prompt, which the model can use to infer information about the loss function. Pre-training, on the other hand, just uses a bunch of internet text without any context.
I am generally confused about what it means for a language model to have goals. Overall I’m just so unsure about this that I can’t reasonably put a probability on models developing an understanding of the base objective before goal directedness, but I wouldn’t confidently say this number is high or low. An example of the probability being high is if goal-directedness only emerges in response to RL (this seems unlikely); an example of the probability being low would be if models undergoing pre-training become goal-directed around predicting next tokens early on in training. Insofar as David thinks this probability is high, I do not understand why.
Assuming that pre-training doesn’t create a direct reward optimizer, I expect the goal of the model to shift when the training process shifts to fine-tuning with human feedback. A goal like next token prediction doesn’t make sense in the reinforcement learning context. This shift could result in a more consequentialist goal that could be dangerous in the presence of other key properties. It could become a direct reward optimizer, which is outside of the scope of this sequence. Alternatively, it could start optimizing for real-world consequences such as the base goal, or a proxy goal. Again, this post assumes that it does not become a direct reward optimizer.
That shift would likely take some time, and there would be a period in the early phases when the model has shifted away from its pre-training goal, but doesn’t yet have a new, coherent goal. Meanwhile, the concepts necessary for understanding the base goal should already be present from pre-training.
The first full section of this comment should shed some light on the level of base goal understanding I think is necessary. Essentially, I think the model’s goal needs to be its best representation of the base goal for it to become aligned. Does that help?
Pointing in the other direction, phase changes seems somewhat likely here; humans (sometimes) generally don’t care about outcomes in the world 100 or 1,000 years out, and then they get sold on longtermism and suddenly care about 10,000 years out.
The same comment I just linked to has a section on why I think human value learning is a bad analogy for gradient descent. I would be interested to hear what you think of it!
On what time-span do I care about my goals is plausibly a measure that will be discontinuous. Perhaps this looks like the goal “where am I minimizing loss” jumping horizons from “next token” to “this sentence/response” and perhaps “all of my performance ever” or “all of the performance of models similar to myself ever” or “all of the performance of models similar to myself in the multiverse”. I’m also unconfident about how likely this is, including still being confused about having goals or understanding base objectives, but I would not be surprised if the author turned out to be right that models understand the base objective before doing long-term goal optimization.
For gradient descent to point toward a long-term goal, a hyper-local (infinitesimal) shift in the direction of long-term goals has to improve performance on the current training batch. Even if long-term goals could emerge very rapidly with some pressure in that direction, the parameters would have to be in a place where an infinitesimal shift would get things started in that direction. That seems really unlikely.
Also, for a model to care about something very long-term, wouldn’t that mean it would sacrifice what it cares about in the short-term to accomplish that? And if not, what would it mean to care about the future? If the model is willing to sacrifice short-term values for long-term gains, that can only hurt its short-term performance. But gradient descent makes changes in the direction that increases reward in the current training batch. It can’t systematically optimize for anything else. How would this happen in practice?
Unfortunately I expect that competitive pressures will lead AI developers to want their AIs to pursue long-term objectives, and that might mess things up.
Yeah, I agree. But do you expect them to start with long-term reward? It seems a lot easier to start with short-term reward and then build off of that than to go for a long-term optimizer all at once. It just needs to stay myopic for long enough to point at its best representation of the base goal. And it would likely need to care about a very long goal horizon for the deceptive alignment argument to work.
The deception-relevant situational awareness, which involves understanding one’s future gradient updates, is unlikely to be selected for by gradient descent.
Quick clarification: I think it’s unlikely to happen quickly, not unlikely to happen at all.
My take: I think that GPT-3 probably has the relevant situational awareness in that its world model understands gradient descent. With the current approach to LLMs it seems this just comes pre-loaded into the models.
I agree it will have the basic concepts needed for situational awareness (and understanding the base goal). But unless the model can already engage in general reasoning, it will likely need further gradient updates to actually use knowledge about its own gradients in a self-reflective way. See more details about the last point in this comment.
If we’re saying that the 3rd requires the 2nd then we can just condition on the 2nd: Given that the model gets long-term planning before an understanding of the base objective, how likely is dangerous situational awareness (before understanding of base objective)?
This is an excellent point! I find this hard to reason about, because in my model of this, it would be extremely weird to have a long-term goal without understanding the base goal. It requires at least one of two things:
Very sophisticated pre-training somehow doesn’t create concepts necessary to understand the base goal, such as direction-following and ethical concepts, or
Long-term goals emerge during pre-training, before the model understands the base goal.
Scenario 1 seems very bizarre, especially because large language models already seem to have basic models of relevant concepts. Constitutional AI uses existing ethical concepts to train helpful models to be harmless. There was some RLHF training for helpfulness before they did this, but I doubt that’s where those concepts formed. This comment explains why I think scenario 2 is unlikely.
It’s hard to reason about, and my confidence for this is much lower than my previous arguments, but in the case that the model gets a long-term goal first, I still think the relevant kind of situational awareness is more complicated and difficult than understanding the base goal. So I still think the base goal understanding would probably come first.
Also, I prefer to be called David, not Dave. Thanks!
Sorry about the name mistake. Thanks for the reply. I’m somewhat pessimistic about us two making progress on our disagreements here because it seems to me like we’re very confused about basic concepts related to what we’re talking about. But I will think about this and maybe give a more thorough answer later.
Thank you for your thoughtful feedback! You asked a lot of constructive questions, and I wanted to think carefully about my responses, so sorry for the delay. The first point in particular has helped me refine and clarify my own models.
That’s one plausible goal, but I would consider that a direct reward optimizer. That could be dangerous but is outside of the scope of this sequence. Direct reward optimizers can’t be deceptively aligned, because they are already trying to optimize reward in the training set, so there’s no need for them to do that instrumentally.
Another possibility is that the model could just learn to predict future words based on past words. This is subtly but importantly different from maximizing reward, because optimizing for reward and minimizing loss might require some level of situational awareness about the loss function (but probably not as much as deceptive alignment requires).
A neural network doesn’t necessarily have a way to know that there’s a loss function to be optimized for. Loss is calculated after all of the model’s cognition, but the model would need to conceptualize that loss calculation before it actually happens to explicitly optimize for it. I haven’t thought about this nearly as much as I’ve thought about deceptive alignment, but it seems much more likely for a pre-trained model to form goals directly derived from past tokens (E.G. predict the next token) than directly optimizing for something that’s not in its input (loss). This logic doesn’t carry over to the RL phase, because that will include a lot of information about the training environment in the prompt, which the model can use to infer information about the loss function. Pre-training, on the other hand, just uses a bunch of internet text without any context.
Assuming that pre-training doesn’t create a direct reward optimizer, I expect the goal of the model to shift when the training process shifts to fine-tuning with human feedback. A goal like next token prediction doesn’t make sense in the reinforcement learning context. This shift could result in a more consequentialist goal that could be dangerous in the presence of other key properties. It could become a direct reward optimizer, which is outside of the scope of this sequence. Alternatively, it could start optimizing for real-world consequences such as the base goal, or a proxy goal. Again, this post assumes that it does not become a direct reward optimizer.
That shift would likely take some time, and there would be a period in the early phases when the model has shifted away from its pre-training goal, but doesn’t yet have a new, coherent goal. Meanwhile, the concepts necessary for understanding the base goal should already be present from pre-training.
The first full section of this comment should shed some light on the level of base goal understanding I think is necessary. Essentially, I think the model’s goal needs to be its best representation of the base goal for it to become aligned. Does that help?
The same comment I just linked to has a section on why I think human value learning is a bad analogy for gradient descent. I would be interested to hear what you think of it!
For gradient descent to point toward a long-term goal, a hyper-local (infinitesimal) shift in the direction of long-term goals has to improve performance on the current training batch. Even if long-term goals could emerge very rapidly with some pressure in that direction, the parameters would have to be in a place where an infinitesimal shift would get things started in that direction. That seems really unlikely.
Also, for a model to care about something very long-term, wouldn’t that mean it would sacrifice what it cares about in the short-term to accomplish that? And if not, what would it mean to care about the future? If the model is willing to sacrifice short-term values for long-term gains, that can only hurt its short-term performance. But gradient descent makes changes in the direction that increases reward in the current training batch. It can’t systematically optimize for anything else. How would this happen in practice?
Yeah, I agree. But do you expect them to start with long-term reward? It seems a lot easier to start with short-term reward and then build off of that than to go for a long-term optimizer all at once. It just needs to stay myopic for long enough to point at its best representation of the base goal. And it would likely need to care about a very long goal horizon for the deceptive alignment argument to work.
Quick clarification: I think it’s unlikely to happen quickly, not unlikely to happen at all.
I agree it will have the basic concepts needed for situational awareness (and understanding the base goal). But unless the model can already engage in general reasoning, it will likely need further gradient updates to actually use knowledge about its own gradients in a self-reflective way. See more details about the last point in this comment.
This is an excellent point! I find this hard to reason about, because in my model of this, it would be extremely weird to have a long-term goal without understanding the base goal. It requires at least one of two things:
Very sophisticated pre-training somehow doesn’t create concepts necessary to understand the base goal, such as direction-following and ethical concepts, or
Long-term goals emerge during pre-training, before the model understands the base goal.
Scenario 1 seems very bizarre, especially because large language models already seem to have basic models of relevant concepts. Constitutional AI uses existing ethical concepts to train helpful models to be harmless. There was some RLHF training for helpfulness before they did this, but I doubt that’s where those concepts formed. This comment explains why I think scenario 2 is unlikely.
It’s hard to reason about, and my confidence for this is much lower than my previous arguments, but in the case that the model gets a long-term goal first, I still think the relevant kind of situational awareness is more complicated and difficult than understanding the base goal. So I still think the base goal understanding would probably come first.
Also, I prefer to be called David, not Dave. Thanks!
Sorry about the name mistake. Thanks for the reply. I’m somewhat pessimistic about us two making progress on our disagreements here because it seems to me like we’re very confused about basic concepts related to what we’re talking about. But I will think about this and maybe give a more thorough answer later.