If a model is deceptively aligned after fine-tuning, it seems most likely to me that it’s because it was deceptively aligned during pre-training.
How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong.
I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.
I don’t know how common each view is. My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
I don’t find the argument about “take actions with effects in the real world” --> “deceptive alignment,” and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.
I don’t know how common each view is either, but I want to note that @evhub has stated that he doesn’t think pre-training is likely to create deception:
The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
Also,
My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
Do you have any recommendations for discussions of whether pre-training or fine-tuning is more likely to produce deceptive alignment?
I’ve argued about this point with Evan a few times but still don’t quite understand his take. I’d be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple—produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don’t yet have such a picture, then deceptive alignment also won’t yield good performance.)
My intuition for why “actions that have effects in the real world” might promote deception is that maybe the “no causation without manipulation” idea is roughly correct. In this case, a self-supervised learner won’t develop the right kind of model of its training process, but the fine-tuned learner might.
I think “no causation without manipulation” must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it’s predicting, including a lot of “causal” stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning.
Maybe this falls apart if I try to make it all more precise—these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.
How common do you think this view is? My impression is that most AI safety researchers think the opposite, and I’d like to know if that’s wrong.
I’m agnostic; pretraining usually involves a lot more training, but also fine tuning might involve more optimisation towards “take actions with effects in the real world”.
I don’t know how common each view is. My guess would be that in the old days this was the more common view, but there’s been a lot more discussion of deceptive alignment recently on LW.
I don’t find the argument about “take actions with effects in the real world” --> “deceptive alignment,” and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.
I don’t know how common each view is either, but I want to note that @evhub has stated that he doesn’t think pre-training is likely to create deception:
Also,
Do you have any recommendations for discussions of whether pre-training or fine-tuning is more likely to produce deceptive alignment?
I’ve argued about this point with Evan a few times but still don’t quite understand his take. I’d be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple—produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don’t yet have such a picture, then deceptive alignment also won’t yield good performance.)
My intuition for why “actions that have effects in the real world” might promote deception is that maybe the “no causation without manipulation” idea is roughly correct. In this case, a self-supervised learner won’t develop the right kind of model of its training process, but the fine-tuned learner might.
I think “no causation without manipulation” must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it’s predicting, including a lot of “causal” stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning.
Maybe this falls apart if I try to make it all more precise—these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.