I’ve argued about this point with Evan a few times but still don’t quite understand his take. I’d be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple—produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don’t yet have such a picture, then deceptive alignment also won’t yield good performance.)
My intuition for why “actions that have effects in the real world” might promote deception is that maybe the “no causation without manipulation” idea is roughly correct. In this case, a self-supervised learner won’t develop the right kind of model of its training process, but the fine-tuned learner might.
I think “no causation without manipulation” must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it’s predicting, including a lot of “causal” stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning.
Maybe this falls apart if I try to make it all more precise—these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.
I’ve argued about this point with Evan a few times but still don’t quite understand his take. I’d be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple—produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don’t yet have such a picture, then deceptive alignment also won’t yield good performance.)
My intuition for why “actions that have effects in the real world” might promote deception is that maybe the “no causation without manipulation” idea is roughly correct. In this case, a self-supervised learner won’t develop the right kind of model of its training process, but the fine-tuned learner might.
I think “no causation without manipulation” must be substantially wrong. If it was entirely correct, I think one would have to say that pretraining ought not to help achieve high performance on a standard RLHF objective, which is obviously false. It still seems plausible to me that a) the self-supervised learner learns a lot about the world it’s predicting, including a lot of “causal” stuff and b) there are still some gaps in its model regarding its own role in this world, which can be filled in with the right kind of fine-tuning.
Maybe this falls apart if I try to make it all more precise—these are initial thoughts, not the outcomes of trying to build a clear theory of the situation.