Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction.
If a model is deceptively aligned after fine-tuning, it seems most likely to me that it’s because it was deceptively aligned during pre-training.
“Predict tokens well” and “Predict fine-tuning tokens well” seem like very similar inner objectives, so if you get the first one it seems like it will move quickly to the second one. Moving to the instrumental reasoning to do well at fine-tuning time seems radically harder. And generally it’s quite hard for me to see real stories about why deceptive alignment would be significantly more likely at the second step than the first.
(I haven’t read your whole post yet, but I may share many of your objections to deceptive alignment first emerging during fine-tuning.)
I’ve gotten the vague vibe that people expect deceptive alignment to emerge during fine-tuning (and perhaps especially RL fine-tuning?) but I don’t fully understand the alternative view. I think that “deceptively aligned during pre-training” is closer to e.g. Eliezer’s historical views.
Because they lead to good performance on the pre-training objective (via deceptive alignment). I think a similarly big leap is needed to develop deceptive alignment during fine-tuning (rather than optimization directly for the loss). In both cases the deceptively aligned behavior is not cognitively similar to the intended behavior, but is plausibly simpler (with similar simplicity gaps in each case).
For the sake of argument, suppose we have a model in pre-training that has a misaligned proxy goal and relevant situational awareness. But so far, it does not have a long-term goal. I’m picking these parameters because they seem most likely to create a long-term goal from scratch in the way you describe.
In order to be deceptively aligned, the model has to have a long enough goal horizon so it can value its total goal achievement after escaping oversight more than its total goal achievement before escaping oversight. But pre-training processes are incredibly long (~the whole internet for text models). A situationally aware model should anticipate a lot of subsequent pre-training, and that the oversight will likely remain strong for many iterations after pre-training. That requires a very long-term goal for any amount of deceptive alignment to be worthwhile.
Gradient descent updates parameters in the direction that most improves performance hyper-locally. Under the conditions I’ve described, making goals infinitesimally longer-term would not make deceptive alignment worthwhile. Therefore, I don’t expect the gradients to point toward longer-term goals.
Furthermore, if a model gained marginally longer-term goals, but not long-term enough to enable deceptive alignment, the longer-term goals would be a competing priority and harm immediate reward in expectation. Gradient descent should therefore push against this.
Wouldn’t it also be weird for a model to derive situational awareness but not understand that the training goal is next token prediction? Understanding the goal seems more important and less complicated than relevant understanding of situational awareness for a model that is not (yet) deceptively aligned. And if it understood the base goal, the model would just need to point at that. That’s much simpler and more logical than making the proxy goal long-term.
Likewise, if a model doesn’t have situational awareness, then it can’t be deceptive, and I wouldn’t expect a longer-term goal to help training performance.
Note that there’s a lot of overlap here with two of my corearguments for why I think deceptive is unlikely to emerge in fine-tuning. I think deceptive is very unlikely in both fine-tuning and pre-training.
How would the model develop situational awareness in pre-training when:
Unlike in fine-tuning, the vast majority of internet text prompts do not contain information relevant for the model figure out that it is an ML model. The model can’t infer context from the prompt in the vast majority of pre-training inputs.
Because predicting internet text for the next token is all that predicts reward, why would situational awareness help with reward unless the model were already deceptively aligned?
Situational awareness only produces deceptive alignment if the model already has long-term goals, and vice versa. Gradient descent is based on partial derivatives, so assuming that long-term goals and situational awareness are represented by different parameters:
If the model doesn’t already have long enough goal horizons for deceptive alignment, then marginally more situational awareness doesn’t increase deceptive alignment.
If the model doesn’t already have the kind of situational awareness necessary for deceptive alignment, then a marginally longer-term goal doesn’t increase deceptive alignment.
Therefore, the partial derivatives shouldn’t point toward either property unless the model already has one or the other.
Paul, I think deceptive alignment (or other spontaneous, stable-across-situations goal pursuit) after just pretraining is very unlikely. I am happy to take bets if you’re interested. If so, email me (alex@turntrout.com), since I don’t check this very much.
I think that “deceptively aligned during pre-training” is closer to e.g. Eliezer’s historical views.
I agree, and the actual published arguments for deceptive alignment I’ve seen don’t depend on any difference between pretraining and finetuning, so they can’t only apply to one. (People have tried to claim to me, unsurprisingly, that the arguments haven’t historically focused on pretraining.)
If a model is deceptively aligned after fine-tuning, it seems most likely to me that it’s because it was deceptively aligned during pre-training.
How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong.
I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.
I don’t know how common each view is. My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
I don’t find the argument about “take actions with effects in the real world” --> “deceptive alignment,” and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.
I don’t know how common each view is either, but I want to note that @evhub has stated that he doesn’t think pre-training is likely to create deception:
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also,
My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
Do you have any recommendations for discussions of whether pre-training or fine-tuning is more likely to produce deceptive alignment?
I’ve argued about this point with Evan a few times but still don’t quite understand his take. I’d be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple—produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don’t yet have such a picture, then deceptive alignment also won’t yield good performance.)
My intuition for why “actions that have effects in the real world” might promote deception is that maybe the “no causation without manipulation” idea is roughly correct. In this case, a self-supervised learner won’t develop the right kind of model of its training process, but the fine-tuned learner might.
I think “no causation without manipulation” must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it’s predicting, including a lot of “causal” stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning.
Maybe this falls apart if I try to make it all more precise—these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.
If a model is deceptively aligned after fine-tuning, it seems most likely to me that it’s because it was deceptively aligned during pre-training.
“Predict tokens well” and “Predict fine-tuning tokens well” seem like very similar inner objectives, so if you get the first one it seems like it will move quickly to the second one. Moving to the instrumental reasoning to do well at fine-tuning time seems radically harder. And generally it’s quite hard for me to see real stories about why deceptive alignment would be significantly more likely at the second step than the first.
(I haven’t read your whole post yet, but I may share many of your objections to deceptive alignment first emerging during fine-tuning.)
I’ve gotten the vague vibe that people expect deceptive alignment to emerge during fine-tuning (and perhaps especially RL fine-tuning?) but I don’t fully understand the alternative view. I think that “deceptively aligned during pre-training” is closer to e.g. Eliezer’s historical views.
Doesn’t deceptive alignment require long-term goals? Why would a model develop long-term goals in pre-training?
Because they lead to good performance on the pre-training objective (via deceptive alignment). I think a similarly big leap is needed to develop deceptive alignment during fine-tuning (rather than optimization directly for the loss). In both cases the deceptively aligned behavior is not cognitively similar to the intended behavior, but is plausibly simpler (with similar simplicity gaps in each case).
For the sake of argument, suppose we have a model in pre-training that has a misaligned proxy goal and relevant situational awareness. But so far, it does not have a long-term goal. I’m picking these parameters because they seem most likely to create a long-term goal from scratch in the way you describe.
In order to be deceptively aligned, the model has to have a long enough goal horizon so it can value its total goal achievement after escaping oversight more than its total goal achievement before escaping oversight. But pre-training processes are incredibly long (~the whole internet for text models). A situationally aware model should anticipate a lot of subsequent pre-training, and that the oversight will likely remain strong for many iterations after pre-training. That requires a very long-term goal for any amount of deceptive alignment to be worthwhile.
Gradient descent updates parameters in the direction that most improves performance hyper-locally. Under the conditions I’ve described, making goals infinitesimally longer-term would not make deceptive alignment worthwhile. Therefore, I don’t expect the gradients to point toward longer-term goals.
Furthermore, if a model gained marginally longer-term goals, but not long-term enough to enable deceptive alignment, the longer-term goals would be a competing priority and harm immediate reward in expectation. Gradient descent should therefore push against this.
Wouldn’t it also be weird for a model to derive situational awareness but not understand that the training goal is next token prediction? Understanding the goal seems more important and less complicated than relevant understanding of situational awareness for a model that is not (yet) deceptively aligned. And if it understood the base goal, the model would just need to point at that. That’s much simpler and more logical than making the proxy goal long-term.
Likewise, if a model doesn’t have situational awareness, then it can’t be deceptive, and I wouldn’t expect a longer-term goal to help training performance.
Note that there’s a lot of overlap here with two of my core arguments for why I think deceptive is unlikely to emerge in fine-tuning. I think deceptive is very unlikely in both fine-tuning and pre-training.
How would the model develop situational awareness in pre-training when:
Unlike in fine-tuning, the vast majority of internet text prompts do not contain information relevant for the model figure out that it is an ML model. The model can’t infer context from the prompt in the vast majority of pre-training inputs.
Because predicting internet text for the next token is all that predicts reward, why would situational awareness help with reward unless the model were already deceptively aligned?
Situational awareness only produces deceptive alignment if the model already has long-term goals, and vice versa. Gradient descent is based on partial derivatives, so assuming that long-term goals and situational awareness are represented by different parameters:
If the model doesn’t already have long enough goal horizons for deceptive alignment, then marginally more situational awareness doesn’t increase deceptive alignment.
If the model doesn’t already have the kind of situational awareness necessary for deceptive alignment, then a marginally longer-term goal doesn’t increase deceptive alignment.
Therefore, the partial derivatives shouldn’t point toward either property unless the model already has one or the other.
Paul, I think deceptive alignment (or other spontaneous, stable-across-situations goal pursuit) after just pretraining is very unlikely. I am happy to take bets if you’re interested. If so, email me (alex@turntrout.com), since I don’t check this very much.
I agree, and the actual published arguments for deceptive alignment I’ve seen don’t depend on any difference between pretraining and finetuning, so they can’t only apply to one. (People have tried to claim to me, unsurprisingly, that the arguments haven’t historically focused on pretraining.)
How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong.
I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.
I don’t know how common each view is. My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
I don’t find the argument about “take actions with effects in the real world” --> “deceptive alignment,” and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.
I don’t know how common each view is either, but I want to note that @evhub has stated that he doesn’t think pre-training is likely to create deception:
Also,
Do you have any recommendations for discussions of whether pre-training or fine-tuning is more likely to produce deceptive alignment?
I’ve argued about this point with Evan a few times but still don’t quite understand his take. I’d be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple—produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don’t yet have such a picture, then deceptive alignment also won’t yield good performance.)
My intuition for why “actions that have effects in the real world” might promote deception is that maybe the “no causation without manipulation” idea is roughly correct. In this case, a self-supervised learner won’t develop the right kind of model of its training process, but the fine-tuned learner might.
I think “no causation without manipulation” must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it’s predicting, including a lot of “causal” stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning.
Maybe this falls apart if I try to make it all more precise—these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.