I think our work is aimed at reducing the theory-practice gap of any alignment schemes that attempt to improve worst-case performance by training the model on data that was selected in the hope of eliciting bad behavior from the model. For example, one of the main ingredients of our project is paying people to try to find inputs that trick the model, then training the model on these adversarial examples.
Many different alignment schemes involve some type of adversarial training. The kind of adversarial training we’re doing, where we just rely on human ingenuity, isn’t going to work for ensuring good behavior from superhuman models. But getting good at the simple, manual version of adversarial training seems like plausibly a prerequisite for being able to do research on the more complicated techniques that might actually scale.
I think our work is aimed at reducing the theory-practice gap of any alignment schemes that attempt to improve worst-case performance by training the model on data that was selected in the hope of eliciting bad behavior from the model. For example, one of the main ingredients of our project is paying people to try to find inputs that trick the model, then training the model on these adversarial examples.
Many different alignment schemes involve some type of adversarial training. The kind of adversarial training we’re doing, where we just rely on human ingenuity, isn’t going to work for ensuring good behavior from superhuman models. But getting good at the simple, manual version of adversarial training seems like plausibly a prerequisite for being able to do research on the more complicated techniques that might actually scale.