Thanks for your thorough response, and yeah, I’m broadly on board with all that. I think learning from detailed text behind decisions, not just the single-bit decision itself, is a great idea that can leverage a lot of recent work.
I don’t think that using modern ML to create a model of legal text is directly promising from an alignment standpoint, but by holding out some of your dataset (e.g. a random sample, or all decisions about a specific topic, or all decisions later than 2021), you can test the generalization properties of the model, and more importantly test interventions intended to improve those properties.
I don’t think we have that great a grasp right now on how to use human feedback to get models to generalize to situations the humans themselves can’t navigate. This is actually a good situation for sandwiching: suppose most text about a specific topic (e.g. use of a specific technology) is held back from the training set, and the model starts out bad at predicting that text. Could we leverage human feedback from non-experts in those cases (potentially even humans who start out basically ignorant about the topic) to help the model generalize better than those humans could alone? This is an intermediate goal that it would be great to advance towards.
Thanks for your thorough response, and yeah, I’m broadly on board with all that. I think learning from detailed text behind decisions, not just the single-bit decision itself, is a great idea that can leverage a lot of recent work.
I don’t think that using modern ML to create a model of legal text is directly promising from an alignment standpoint, but by holding out some of your dataset (e.g. a random sample, or all decisions about a specific topic, or all decisions later than 2021), you can test the generalization properties of the model, and more importantly test interventions intended to improve those properties.
I don’t think we have that great a grasp right now on how to use human feedback to get models to generalize to situations the humans themselves can’t navigate. This is actually a good situation for sandwiching: suppose most text about a specific topic (e.g. use of a specific technology) is held back from the training set, and the model starts out bad at predicting that text. Could we leverage human feedback from non-experts in those cases (potentially even humans who start out basically ignorant about the topic) to help the model generalize better than those humans could alone? This is an intermediate goal that it would be great to advance towards.
Interesting. I will think more about the sandwiching approach between non-legal experts and legal experts.