I love this post and also expect it to be something that I point people towards in the future!
I was wondering about what kind of alignment failure—i.e. outer or inner alignment—you had in mind when describing sycophant models (for schemer models, it’s obviously an inner alignment failure).
It seems you could get sycophant models via inner alignment failure, because you could train them a sensible, well-specified objective functions, and yet the model learns to pursue human approval anyway (because “pursuing human approval” turned out to be more easily discovered by SGD).
It also seems you could also sycophant models them via outer alignment failure, because e.g. a model trained using naive reward modelling (which would be an obviously terrible objective) seems very likely to yield a model that is pursuing approval from the humans whose feedback is used in training the reward model.
Does this seem right to you, and if so, which kind of alignment failure did you have in mind?
(Paul has written most explicitly about what a world full of advanced sycophants looks like/how it leads to existential catastrophe, and his stories are about outer alignment, so I’d be especially curious if you disagreed with that.)
I love this post and also expect it to be something that I point people towards in the future!
I was wondering about what kind of alignment failure—i.e. outer or inner alignment—you had in mind when describing sycophant models (for schemer models, it’s obviously an inner alignment failure).
It seems you could get sycophant models via inner alignment failure, because you could train them a sensible, well-specified objective functions, and yet the model learns to pursue human approval anyway (because “pursuing human approval” turned out to be more easily discovered by SGD).
It also seems you could also sycophant models them via outer alignment failure, because e.g. a model trained using naive reward modelling (which would be an obviously terrible objective) seems very likely to yield a model that is pursuing approval from the humans whose feedback is used in training the reward model.
Does this seem right to you, and if so, which kind of alignment failure did you have in mind?
(Paul has written most explicitly about what a world full of advanced sycophants looks like/how it leads to existential catastrophe, and his stories are about outer alignment, so I’d be especially curious if you disagreed with that.)
I was imagining Sycophants as an outer alignment failure, assuming the model is trained with naive RL from human feedback.