I was imagining Sycophants as an outer alignment failure, assuming the model is trained with naive RL from human feedback.
I was imagining Sycophants as an outer alignment failure, assuming the model is trained with naive RL from human feedback.