I was imagining Sycophants as an outer alignment failure, assuming the model is trained with naive RL from human feedback.
Current theme: default
Less Wrong (text)
Less Wrong (link)
I was imagining Sycophants as an outer alignment failure, assuming the model is trained with naive RL from human feedback.