Focusing on empirical results:Learning to summarize from human feedback was good, for several reasons.I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.
nit: link on “reasons” was pasted twice. For others it’s https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
Also hadn’t seen that paper. Thanks!
Focusing on empirical results:
Learning to summarize from human feedback was good, for several reasons.
I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.
nit: link on “reasons” was pasted twice. For others it’s https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
Also hadn’t seen that paper. Thanks!