Really cool topic, thanks for sharing. One of the ways that alignment techniques could gain adoption and reduce the alignment tax is by integrating them into popular open source libraries.
For example, the library TRL allows researchers to implement RLHF techniques which can benefit safety, but can also contribute to dangerous capabilities. On the other hand, I’m not aware of any open-source implementation of the techniques described in Red Teaming LMs with LMs, which could be used to filter or fine-tune the outputs of a generative language model.
Hopefully we’ll see more open source contributions of safety techniques, which could bring more interest to safety topics. Some might argue that implementing safety techniques in current models doesn’t reduce x-risk, and they’re probably right that current models aren’t directly posing x-risks, but early adoption of safety techniques seems useful for ensuring further adoption in the years to come.
Really cool topic, thanks for sharing. One of the ways that alignment techniques could gain adoption and reduce the alignment tax is by integrating them into popular open source libraries.
For example, the library TRL allows researchers to implement RLHF techniques which can benefit safety, but can also contribute to dangerous capabilities. On the other hand, I’m not aware of any open-source implementation of the techniques described in Red Teaming LMs with LMs, which could be used to filter or fine-tune the outputs of a generative language model.
Hopefully we’ll see more open source contributions of safety techniques, which could bring more interest to safety topics. Some might argue that implementing safety techniques in current models doesn’t reduce x-risk, and they’re probably right that current models aren’t directly posing x-risks, but early adoption of safety techniques seems useful for ensuring further adoption in the years to come.
New open source implementation of factored cognition / IDA techniques from Ought! https://www.lesswrong.com/posts/X5L9g4fXmhPdQrBCA/a-library-and-tutorial-for-factored-cognition-with-language