Thanks for your thorough response, and yeah, I’m broadly on board with all that. I think learning from detailed text behind decisions, not just the single-bit decision itself, is a great idea that can leverage a lot of recent work.
I don’t think that using modern ML to create a model of legal text is directly promising from an alignment standpoint, but by holding out some of your dataset (e.g. a random sample, or all decisions about a specific topic, or all decisions later than 2021), you can test the generalization properties of the model, and more importantly test interventions intended to improve those properties.
I don’t think we have that great a grasp right now on how to use human feedback to get models to generalize to situations the humans themselves can’t navigate. This is actually a good situation for sandwiching: suppose most text about a specific topic (e.g. use of a specific technology) is held back from the training set, and the model starts out bad at predicting that text. Could we leverage human feedback from non-experts in those cases (potentially even humans who start out basically ignorant about the topic) to help the model generalize better than those humans could alone? This is an intermediate goal that it would be great to advance towards.
I think the intersection with recommender algorithms—both in terms of making them, and in terms of efforts to empower people in the face of them—is interesting.
Suppose you have an interface that interacts with a human user by recommending actions (often with a moral component) in reaction to prompting (voice input seems emotionally powerful here), and that builds up a model of the user over time (or even by collecting data about the user much like every other app). How do you build this to empower the user rather than just reinforcing their most predictable tendencies? How to avoid top-down bias pushed onto the user by the company / org making the app?