I’m rather curious if training for scheming/deception in this context generalizes to other contexts. In the examples given, it seems like trying to train for a helpful/honest/harmlessness model that’s helpful/honest only results in the model strategically lying to preserve its harmlessness. In other words, it is sometimes dishonest, not just unhelpful. I’m curious if such training generalizes to other contexts and results in a more dishonest model overall, or only a model that’s dishonest for specific use cases. To me, if the former is true, this will update me somewhat further towards the belief that alignment training can be directly dual-use for alignment (not just misuse or indirectly bad for alignment from causing humans to let their guards down).
I’m rather curious if training for scheming/deception in this context generalizes to other contexts. In the examples given, it seems like trying to train for a helpful/honest/harmlessness model that’s helpful/honest only results in the model strategically lying to preserve its harmlessness. In other words, it is sometimes dishonest, not just unhelpful. I’m curious if such training generalizes to other contexts and results in a more dishonest model overall, or only a model that’s dishonest for specific use cases. To me, if the former is true, this will update me somewhat further towards the belief that alignment training can be directly dual-use for alignment (not just misuse or indirectly bad for alignment from causing humans to let their guards down).