Paul, I think deceptive alignment (or other spontaneous, stable-across-situations goal pursuit) after just pretraining is very unlikely. I am happy to take bets if youâre interested. If so, email me (alex@turntrout.com), since I donât check this very much.
I think that âdeceptively aligned during pre-trainingâ is closer to e.g. Eliezerâs historical views.
I agree, and the actual published arguments for deceptive alignment Iâve seen donât depend on any difference between pretraining and finetuning, so they canât only apply to one. (People have tried to claim to me, unsurprisingly, that the arguments havenât historically focused on pretraining.)
Paul, I think deceptive alignment (or other spontaneous, stable-across-situations goal pursuit) after just pretraining is very unlikely. I am happy to take bets if youâre interested. If so, email me (alex@turntrout.com), since I donât check this very much.
I agree, and the actual published arguments for deceptive alignment Iâve seen donât depend on any difference between pretraining and finetuning, so they canât only apply to one. (People have tried to claim to me, unsurprisingly, that the arguments havenât historically focused on pretraining.)