Robust alignment requires alignment-relevant intervention during pretraining
Interpreting this as saying a necessary condition for robust alignment is training data that captures good values and discourages bad values. I think there’s good evidence this matters lots for current systems so lean to agree. It’s still plausible to me that robust alignment could be achieved with post-training interventions and relatively neutral pre-training setups.
That was the intervention class we had in mind, though there could be other pretraining interventions that don’t fall cleanly into good/bad values (e.g. promoting risk aversion)
Interpreting this as saying a necessary condition for robust alignment is training data that captures good values and discourages bad values. I think there’s good evidence this matters lots for current systems so lean to agree. It’s still plausible to me that robust alignment could be achieved with post-training interventions and relatively neutral pre-training setups.
That was the intervention class we had in mind, though there could be other pretraining interventions that don’t fall cleanly into good/bad values (e.g. promoting risk aversion)