You’re focusing on the issue that current laws don’t capture everything we care about, which is definitely a problem.
However, the bigger problem is that there isn’t a clear definition of what does and doesn’t break the law that you can write down in a program.
You might say that we could train an AI system to learn what is and isn’t breaking the law; but then you might as well train an AI system to learn what is and isn’t the thing you want it to do. It’s not clear why training to follow laws would be easier than training it to do what you want; the latter would be a much more useful AI system.
Reasons other than directly getting value alignment from law that you might want to program AI to follow the law:
We will presumably want organizations with AI to be bound by law. Making their AI agents bound by law seems very important to that.
Relatedly, we probably want to be able to make ex ante deals that obligate AI/AI-owners to do stuff post-AGI, which seems much harder if AGI can evade enforcement.
We don’t want to rely on the incentives of human principals to ensure their agents advance their goals in purely legal ways, especially given AGI’s ability to e.g. hide its actions or motives.
Agree with all of these, but they don’t require you to program your AI to follow the law (sounds horrendously difficult), they require that you can enforce the law on AI systems. If you’ve solved alignment to arbitrary tasks/preferences, then I’d expect you can solve the enforcement problem too—if you’re worried about criminals having powerful AI systems, you can give powerful AI systems to the police / judicial system / whatever else you think is important.
My guess is that programming AI to follow law might be easier or preferable to enforcing against human-principals. A weakly aligned AI (not X-risk or risk to principals, but not bound by law or general human morality) deployed by a human principal will probably come across illegal ways to advance its principal’s goals. It will also probably be able to hide its actions, obscure its motives, and/or evade detection better than humans could. If so, the equilibrium strategy is to give minimal oversight to the AI agent and tacitly allow it to break the law while advancing the principal’s goals, since enforcement against the principal is unlikely. This seems bad!
I agree that getting a guarantee of following the law is (probably) better than trying to ensure it through enforcement, all else equal. I also agree that in principle programming the AI to follow the law could give such a guarantee. So in some normative sense, I agree that it would be better if it were programmed to follow the law.
My main argument here is that it is not worth the effort. This factors into two claims:
First, it would be hard to do. I am a programmer / ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers / ML researchers would agree with me on this.
Second, it doesn’t provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.
It will also probably be able to hide its actions, obscure its motives, and/or evade detection better than humans could.
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the police’s job easier.
First, it would be hard to do. I am a programmer / ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers / ML researchers would agree with me on this.
This is valuable information. However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.
Second, it doesn’t provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the police’s job easier.
This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.
Certainly you still need legal accountability—why wouldn’t we have that? If we solve alignment, then we can just have the AI’s owner be accountable for any law-breaking actions the AI takes.
This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say “the computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager won’t take it”.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
When you say “programming AI to follow law” I imagine case 1 above (but for AI systems instead of humans). Certainly the OP seemed to be arguing for this case. This is the thing I think is extremely difficult.
I am much happier about AI systems learning about the law via case 2 above, which would enable the AI police applications I mentioned above.
However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.
I suspect they are thinking about case 2 above? Or they might be thinking of self-driving car type applications where you have an in-code representation of the world? Idk, I feel confident enough of this that I’d predict that there is a miscommunication somewhere, rather than an actual strong difference of opinion between me and them.
Certainly you still need legal accountability—why wouldn’t we have that? If we solve alignment, then we can just have the AI’s owner be accountable for any law-breaking actions the AI takes.
I agree that that is a very good and desirable step to take. However, as I said, it also incentives the AI-agent to obfuscate its actions and intentions to save its principal. In the human context, human agents do this but are independently disincentivized from breaking the law they face legal liability (a disincentive) for their actions. I want (and I suspect you also want) AI systems to have such incentivization.
If I understand correctly, you identify two ways to do this in the teenager analogy:
Rewiring
Explaining laws and their consequences and letting the agent’s existing incentives do the rest.
I could be wrong about this, but ultimately, for AI systems, it seems like both are actually similarly difficult. As you’ve said, for 2. to be most effective, you probably need “AI police.” Those police will need a way of interpreting the legality of an AI agent’s {”mental” state; actions} and mapping them only existing laws.
But if you need to do that for effective enforcement, I don’t see why (from a societal perspective) we shouldn’t just do that on the actor’s side and not the “police’s” side. Baking the enforcement into the agents has the benefits of:
Not incentivizing an arms race
Giving the enforcer’s a clearer picture of the AI’s “mental state”
I want (and I suspect you also want) AI systems to have such incentivization.
Not obviously. My point is just that if the AI is aligned with an human principal, and that human principal can be held accountable for the AI’s actions, then that automatically disincentivizes AI systems from breaking the law.
(I’m not particularly opposed to AI systems being disincentivized directly, e.g. by making it possible to hold AI systems accountable for their actions. It just doesn’t seem necessary in the world where we’ve solved alignment.)
I don’t see why (from a societal perspective) we shouldn’t just do that on the actor’s side and not the “police’s” side.
I agree that doing it on the actor’s side is better if you can ensure it for all actors, but you have to also prevent the human principal from getting a different actor that isn’t bound by law.
E.g. if you have a chauffeur who refuses to exceed the speed limit (in a country where the speed limit that’s actually enforced is 10mph higher), you fire that chauffeur and find a different one.
(Also, I’m assuming you’re teaching the agent to follow the law via something like case 2 above, where you have it read the law and understand it using its existing abilities, and then train it somehow to not break the law. If you were instead thinking something like case 1, I’d make the second argument that it isn’t likely to work.)
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say “the computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager won’t take it”.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they’re used?
EDIT: Hmm, maybe they need strong incentives to check in with law enforcement periodically? This would be bounded per interval of time, and also (much) greater in absolute sign than any other reward they could get per period.
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they’re used?
I’m going to interpret this as:
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Assume that enforcement is impossible.
Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)
But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesn’t seem like a strong argument to focus on case 1.
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Couldn’t the AI end up misaligned with the owners by accident, even if they’re aligned with the rest of humanity? The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
I edited my comment above before I got your reply to include the possibility of the AI being incentivized to ensure it gets monitored by law enforcement. Its reward function could look like
f(x)+∞∑i=1IMi(x)
where f is bounded to have a range of length ≤1, and IMi(x) is 1 if the AI is monitored by law enforcement in period i (and passes some test) and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term can’t evaluate to infinity since that would allow f to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow f to overcome the IMi.
Couldn’t the AI end up misaligned with the owners by accident, even if they’re aligned with the rest of humanity?
Yes, but as I said earlier, I’m assuming the alignment problem has already been solved when talking about enforcement. I am not proposing enforcement as a solution to alignment.
If you haven’t solved the alignment problem, enforcement doesn’t help much, because you can’t rely on your AI-enabled police to help catch the AI-enabled criminals, because the police AI itself may not be aligned with the police.
The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
Case 2 is assuming that you already have an intelligent agent with motivations, and then trying to deal with that after the fact. I agree this is not going to work for alignment. If for some reason I could only do 1 or 2 for alignment, I would try 1. (But there are in fact a bunch of other things that you can do.)
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the police’s job easier.
Isn’t most of this after a crime has already been committed? Is that enough if it’s an existential risk? To handle this, would we want continuous monitoring of autonomous AIs, at which point aren’t we actually just taking their autonomy away?
Also, if we want to automate “detect bad behavior”, wouldn’t that require AI alignment, too? If we don’t fully automate it, then can we be confident that humans can keep up with everything they need to check themselves, given that AIs could work extremely fast? AIs might learn how much work humans can keep up with and then overwhelm them.
Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.
Cullen’s argument was “alignment may not be enough, even if you solve alignment you might still want to program your AI to follow the law because <reasons>.” So in my responses I’ve been assuming that we have solved alignment; I’m arguing that after solving alignment, AI-powered enforcement will probably be enough to handle the problems Cullen is talking about. Some quotes from Cullen’s comment (emphasis mine):
Reasons other than directly getting value alignment from law that you might want to program AI to follow the law
We will presumably want organizations with AI to be bound by law.
We don’t want to rely on the incentives of human principals to ensure their agents advance their goals in purely legal ways
Some responses to your comments:
if we want to automate “detect bad behavior”, wouldn’t that require AI alignment, too?
Yes, I’m assuming we’ve solved alignment here.
Isn’t most of this after a crime has already been committed?
Good enforcement is also a deterrent against crime (someone without any qualms about murder will still usually not murder because of the harsh penalties and chance of being caught).
Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.
Remember that the police are also AI-enabled, and can find new ways of detecting things. Even so, this is possible: but it’s also possible today, without AI: criminals presumably constantly find new ways of hiding things from the police.
You might say that we could train an AI system to learn what is and isn’t breaking the law; but then you might as well train an AI system to learn what is and isn’t the thing you want it to do. It’s not clear why training to follow laws would be easier than training it to do what you want; the latter would be a much more useful AI system.
Some reasons why this might be true:
Law is less indeterminate than you might think, and probably more definite than human values
Law has authoritative corpora readily available
Law has built-in, authoritative adjudication/dispute resolution mechanisms. Cf. AI Safety by Debate.
In general, my guess is that there is a large space of actions that:
Are unaligned, and
Are illegal, and
Due to the formality of parts of law and the legal process, an AI can be made to have higher confidence that an action is (2) than (1).
However, it’s very possible that, as you suggest, solving AI legal compliance requires solving AI Safety generally. This seems somewhat unlikely to me but I have low confidence in this since I’m not an expert. :-)
Law is less indeterminate than you might think, and probably more definite than human values
Agreed that “human values” is harder and more indeterminate, because it’s a tricky philosophical problem that may not even have a solution.
I don’t think “alignment” is harder or more indeterminate, where “alignment” means something like “I have in mind something I want the AI system to do, it does that thing, without trying to manipulate me / deceive me etc.”
Like, idk, imagine there was a law that said “All AI systems must not deceive their users, and must do what they believe their users want”. A real law would probably only be slightly more explicit than that? If so, just creating an AI system that followed only this law would lead to something that meets the criterion I’m imagining. Creating an AI system that follows all laws seems a lot harder.
Due to the formality of parts of law and the legal process, an AI can be made to have higher confidence that an action is (2) than (1).
I think this would probably have been true of expert systems but not so true of deep learning-based systems.
Also, personally I find it easier to tell when my actions are unaligned with <person X whom I know> than when my actions are illegal.
I don’t think “alignment” is harder or more indeterminate, where “alignment” means something like “I have in mind something I want the AI system to do, it does that thing, without trying to manipulate me / deceive me etc.”
Yeah, I agree with this.
imagine there was a law that said “All AI systems must not deceive their users, and must do what they believe their users want”. A real law would probably only be slightly more explicit than that?
I’m not sure that’s true. (Most) real laws have huge bodies of interpretative text surrounding them and examples of real-world applications of them to real-world facts.
Creating an AI system that follows all laws seems a lot harder.
Lawyers approximate generalists: they can take arbitrary written laws and give advice on how to conform behavior to those laws. So a lawyerlike AI might be able to learn general interpretative principles and research skills and be able to simulate legal adjudications of proposed actions.
I think this would probably have been true of expert systems but not so true of deep learning-based systems.
Interesting; I don’t have good intuitions on this!
(Most) real laws have huge bodies of interpretative text surrounding them and examples of real-world applications of them to real-world facts.
Right, I was trying to factor this part out, because it seemed to me that the hope was “the law is explicit and therefore can be programmed in”. But if you want to include all of the interpretative text and examples of real-world application, it starts looking more like “here is a crap ton of data about this law, please understand what this law means and then act in accordance to it”, as opposed to directly hardcoding in the law.
Under this interpretation (which may not be what you meant) this becomes a claim that laws have a lot more data that pinpoints what exactly they mean, relative to something like “what humans want”, and so an AI system will more easily pinpoint it. I’m somewhat sympathetic to this claim, though I think there is a lot of data about “what humans want” in everyday life that the AI can learn from. But my real reason for not caring too much about this is that in this story we rely on the AI’s “intelligence” to “understand” laws, as opposed to “programming it in”; given that we’re worried about superintelligent AI it should be “intelligent” enough to “understand” what humans want as well (given that humans seem to be able to do that).
Lawyers approximate generalists: they can take arbitrary written laws and give advice on how to conform behavior to those laws. So a lawyerlike AI might be able to learn general interpretative principles and research skills and be able to simulate legal adjudications of proposed actions.
I’m not sure what you’re trying to imply with this—does this make the AIs task easier? Harder? The generality somehow implies that the AI is safer?
Like, I don’t get why this point has any bearing on whether it is better to train “lawyerlike AI” or “AI that tries to do what humans want”. If anything, I think it pushes in the “do what humans want” direction, since historically it has been very difficult to create generalist AIs, and easier to create specialist AIs.
(Though I’m not sure I think “AI that tries to do what humans want” is less “general” than lawyerlike AI.)
But my real reason for not caring too much about this is that in this story we rely on the AI’s “intelligence” to “understand” laws, as opposed to “programming it in”; given that we’re worried about superintelligent AI it should be “intelligent” enough to “understand” what humans want as well (given that humans seem to be able to do that).
My intuition is that more formal systems will be easier for AI to understand earlier in the “evolution” of SOTA AI intelligence than less-formal systems. Since law is more formal than human values (including both the way it’s written and the formal significance of interpretative texts), then we might get good law-following before good value alignment.
I’m not sure what you’re trying to imply with this—does this make the AIs task easier? Harder? The generality somehow implies that the AI is safer?
Sorry. I was responding to the “all laws” point. My point was that I think that making a law-following AI that can follow (A) all enumerated laws is not much harder than one that can be made to follow (B) any given law. That is, difficulty of construction scales sub-linearly with the number of laws it needs to follow. The interpretative tools that should get to (B) should be pretty generalizable to (A).
My intuition is that more formal systems will be easier for AI to understand earlier in the “evolution” of SOTA AI intelligence than less-formal systems.
I agree for fully formal systems (e.g. solving SAT problems), but don’t agree for “more formal” systems like law.
Mostly I’m thinking that understanding law would require you to understand language, but once you’ve understood language you also understand “what humans want”. You could imagine a world in which AI systems understand the literal meaning of language but don’t grasp the figurative / pedagogic / Gricean aspects of language, and in that world I think AI systems will understand law earlier than normal English, but that doesn’t seem to be the world we live in:
GPT-2 and other language models don’t seem particularly literal.
We have way more training data about natural language as it is normally used (most of the Internet), relative to natural language meant to be interpreted mostly literally.
Humans find it easier / more “native” to interpret language in the figurative / pedagogic way than to interpret it in the literal way.
My point was that I think that making a law-following AI that can follow (A) all enumerated laws is not much harder than one that can be made to follow (B) any given law.
The key difference in my mind is that the AI system does not need to determine the relative authoritativeness of different pronouncements of human value, since the legal authoritativeness of e.g. caselaw is pretty formalized. But I agree that this is less of an issue if the primary route to alignment is just getting an AI to follow the instructions of its principal.
I suspect current laws capture enough of what we care about that if an AGI followed them “properly”, this wouldn’t lead to worse outcomes than without AGI at all in expectation, but there could be holes to exploit and “properly” is where the challenge is, as you suggest. Many laws would have to be interpreted more broadly than before, perhaps.
You might say that we could train an AI system to learn what is and isn’t breaking the law; but then you might as well train an AI system to learn what is and isn’t the thing you want it to do.
Isn’t interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do? If the AI can find an interpretation of a law according to which an action would break it with high enough probability, then that action would be ruled out. This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
To illustrate, “Maximize paperclips without killing anyone” is not an interpretation of “Maximize paperclips”, but “Any particular person dies at least 1 day earlier with probability > p than they would have by inaction” could be an interpretation of “produce death” (although it might be better to rewrite laws in more specific numeric terms in the first place).
Defining a good search space (and search method) for interpretations of a given statement might still be a very difficult problem, though.
To illustrate, “Maximize paperclips without killing anyone” is not an interpretation of “Maximize paperclips”
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include “and also don’t kill anyone”.
This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
I feel like the word “values” makes this sound more complex than it is, and I’d say we instead want the agent to understand and act in line with what the human wants / intends.
This is then also a problem of reasoning and understanding language: when I say “please help me write good education policy laws”, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
Isn’t interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do?
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include “and also don’t kill anyone”.
That’s what you want, but the sentence “Maximize paperclips” doesn’t imply it through any literal interpretation, nor does “Maximize paperclips” imply “maximize paperclips while killing at least one person”. What I’m looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
This is then also a problem of reasoning and understanding language: when I say “please help me write good education policy laws”, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
I think much more is hidden in “good”, which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
That’s true. I looked at the US Code’s definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of “Any particular person dies at least x earlier with probability > p than they would have by inaction”, or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldn’t need to understand vague and underspecified words like “good”.
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws don’t cover nonidentity cases, as far as I know.)
If you want literal interpretations, specificity, and explicitness, I think you’re in for a bad time:
“Any particular person dies at least x earlier with probability > p than they would have by inaction”
How do you intend to define “person” in terms of the inputs to an AI system (let’s assume a camera image)? How do you compute the “probability” of an event? What is “inaction”?
(There’s also the problem that all actions probably change who does and doesn’t exists, so this law would require the AI system to always take inaction, making it useless.)
How do you intend to define “person” in terms of the inputs to an AI system (let’s assume a camera image)?
Can we just define them as we normally do, e.g. biologically with a functioning brain? Is the concern that AIs won’t be able to tell which inputs represent real things from those that won’t? Or they just won’t be able to apply the definitions correctly generally enough?
How do you compute the “probability” of an event?
The AI would do this. Are AIs that aren’t good at estimating probabilities of events smart enough to worry about? I suppose they could be good at estimating probabilities in specific domains but not generally, or have some very specific failure cases that could be catastrophic.
What is “inaction”?
The AI waits for the next request, turns off or some other inconsequential default action.
(There’s also the problem that all actions probably change who does and doesn’t exists, so this law would require the AI system to always take inaction, making it useless.)
Maybe my wording didn’t capture this well, but my intention was a presentist/necessitarian person-affecting approach (not that I agree with the ethical position). I’ll try again:
“A particular person will have been born with action A and with inaction, and will die at least x earlier with probability > p with A than they would have with inaction.”
Can we just define them as we normally do, e.g. biologically with a functioning brain?
How do you define “biological” and “brain”? Again, your input is a camera image, so you have to build this up starting from sentences of the form “the pixel in the top left corner is this shade of grey”.
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
The AI would do this. Are AIs that aren’t good at estimating probabilities of events smart enough to worry about?
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Let’s take AIs trained by deep reinforcement learning as an example. If you want to encode something like “Any particular person dies at least x earlier with probability > p than they would have by inaction” explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to “do what the user wants”.
The AI waits for the next request, turns off or some other inconsequential default action.
If you’re a self-driving car, it’s very unclear what an inconsequential default action is. (Though I agree in general there’s often some default action that is fine.)
Maybe my wording didn’t capture this well, but my intention was a presentist/necessitarian person-affecting approach (not that I agree with the ethical position).
I mean, the existence part was not the main point—my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can’t predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I’m not arguing that these sorts of butterfly effects are real—I’m not sure—but it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.
Maybe this cuts to the chase: Should we expect AIs to be able to know or do anything in particular well “enough”. I.e. is there one thing in particular we can say AIs will be good at and only get wrong extremely rarely? Is solving this as hard as technical AI alignment in general?
How do you define “biological” and “brain”? Again, your input is a camera image, so you have to build this up starting from sentences of the form “the pixel in the top left corner is this shade of grey”.
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself.
This could be an explicit output we train the AI to predict (possibly part of responses in language).
I mean, the existence part was not the main point—my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can’t predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I “named” a particular person in that sentence. The probability that what I do leads to an earlier death for John Doe is extremely small, and that’s the probability that I’m constraining, for each person separately. This will also in practice prevent the AI from conducting murder lotteries up to a certain probability of being killed, but this probability might be too high, so you could have separate constraints for causing an earlier death for a random person or on the change in average life expectancy in the world to prevent, etc..
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
It really sounds like this sort of training is going to require it to be able to interpret English the way we interpret English (e.g. to read biology textbooks); if you’re going to rely on that I don’t see why you don’t want to rely on that ability when we are giving it instructions.
This could be an explicit output we train the AI to predict (possibly part of responses in language).
That… is ambitious, if you want to do this for every term that exists in laws. But I agree that if you did this, you could try to “translate” laws into code in a literal fashion. I’m fairly confident that this would still be pretty far from what you wanted, because laws aren’t meant to be literal, but I’m not going to try to argue that here.
(Also, it probably wouldn’t be computationally efficient—that “don’t kill a person” law, to be implemented literally in code, would require you to loop over all people, and make a prediction for each one: extremely expensive.)
I “named” a particular person in that sentence.
Ah, I see. In that case I take back my objection about butterfly effects.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
(I am a lawyer by training.)
Yes, this is certainly true. Many laws explicitly or implicitly rely on standards (i.e., less-definite adjudicatory formulas) than hard-and-fast rules. “Reasonableness,” for example, is often a key term in a legal claim or defense. Juries often make such determinations, which also means whether the actual legality of an action is resolved upon adjudication and not ex ante (although an aligned, capable AI could in principle simulate the probability that a jury would find its actions reasonable—that’s what lawyers do.)
I feel like the word “values” makes this sound more complex than it is, and I’d say we instead want the agent to understand and act in line with what the human wants / intends.
Doesn’t “wants / intends” makes this sound less complex than it is? To me this phrasing connotes (not to say you actually believe this) that the goal is for AIs to understand short-term human desires, without accounting for ways in which our wants contradict what we would value in the long term, or ways that individuals’ wants can conflict. Once we add caveats like “what we would want / intend after sufficient rational reflection,” my sense is that “values” just captures that more intuitively. I haven’t surveyed people on this, though, so this definitely isn’t a confident claim on my part.
Once we add caveats like “what we would want / intend after sufficient rational reflection,” my sense is that “values” just captures that more intuitively.
I in fact don’t want to add in those caveats here: I’m suggesting that we tell our AI system to do what we short-term want. (Of course, we can then “short-term want” to do more rational reflection, or to be informed of true and useful things that help us make moral progress, etc.)
I agree that “values” more intuitively captures the thing with all the caveats added in.
You’re focusing on the issue that current laws don’t capture everything we care about, which is definitely a problem.
However, the bigger problem is that there isn’t a clear definition of what does and doesn’t break the law that you can write down in a program.
You might say that we could train an AI system to learn what is and isn’t breaking the law; but then you might as well train an AI system to learn what is and isn’t the thing you want it to do. It’s not clear why training to follow laws would be easier than training it to do what you want; the latter would be a much more useful AI system.
Reasons other than directly getting value alignment from law that you might want to program AI to follow the law:
We will presumably want organizations with AI to be bound by law. Making their AI agents bound by law seems very important to that.
Relatedly, we probably want to be able to make ex ante deals that obligate AI/AI-owners to do stuff post-AGI, which seems much harder if AGI can evade enforcement.
We don’t want to rely on the incentives of human principals to ensure their agents advance their goals in purely legal ways, especially given AGI’s ability to e.g. hide its actions or motives.
Agree with all of these, but they don’t require you to program your AI to follow the law (sounds horrendously difficult), they require that you can enforce the law on AI systems. If you’ve solved alignment to arbitrary tasks/preferences, then I’d expect you can solve the enforcement problem too—if you’re worried about criminals having powerful AI systems, you can give powerful AI systems to the police / judicial system / whatever else you think is important.
My guess is that programming AI to follow law might be easier or preferable to enforcing against human-principals. A weakly aligned AI (not X-risk or risk to principals, but not bound by law or general human morality) deployed by a human principal will probably come across illegal ways to advance its principal’s goals. It will also probably be able to hide its actions, obscure its motives, and/or evade detection better than humans could. If so, the equilibrium strategy is to give minimal oversight to the AI agent and tacitly allow it to break the law while advancing the principal’s goals, since enforcement against the principal is unlikely. This seems bad!
I agree that getting a guarantee of following the law is (probably) better than trying to ensure it through enforcement, all else equal. I also agree that in principle programming the AI to follow the law could give such a guarantee. So in some normative sense, I agree that it would be better if it were programmed to follow the law.
My main argument here is that it is not worth the effort. This factors into two claims:
First, it would be hard to do. I am a programmer / ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers / ML researchers would agree with me on this.
Second, it doesn’t provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the police’s job easier.
This is valuable information. However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.
This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.
Certainly you still need legal accountability—why wouldn’t we have that? If we solve alignment, then we can just have the AI’s owner be accountable for any law-breaking actions the AI takes.
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say “the computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager won’t take it”.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
When you say “programming AI to follow law” I imagine case 1 above (but for AI systems instead of humans). Certainly the OP seemed to be arguing for this case. This is the thing I think is extremely difficult.
I am much happier about AI systems learning about the law via case 2 above, which would enable the AI police applications I mentioned above.
I suspect they are thinking about case 2 above? Or they might be thinking of self-driving car type applications where you have an in-code representation of the world? Idk, I feel confident enough of this that I’d predict that there is a miscommunication somewhere, rather than an actual strong difference of opinion between me and them.
I agree that that is a very good and desirable step to take. However, as I said, it also incentives the AI-agent to obfuscate its actions and intentions to save its principal. In the human context, human agents do this but are independently disincentivized from breaking the law they face legal liability (a disincentive) for their actions. I want (and I suspect you also want) AI systems to have such incentivization.
If I understand correctly, you identify two ways to do this in the teenager analogy:
Rewiring
Explaining laws and their consequences and letting the agent’s existing incentives do the rest.
I could be wrong about this, but ultimately, for AI systems, it seems like both are actually similarly difficult. As you’ve said, for 2. to be most effective, you probably need “AI police.” Those police will need a way of interpreting the legality of an AI agent’s {”mental” state; actions} and mapping them only existing laws.
But if you need to do that for effective enforcement, I don’t see why (from a societal perspective) we shouldn’t just do that on the actor’s side and not the “police’s” side. Baking the enforcement into the agents has the benefits of:
Not incentivizing an arms race
Giving the enforcer’s a clearer picture of the AI’s “mental state”
Not obviously. My point is just that if the AI is aligned with an human principal, and that human principal can be held accountable for the AI’s actions, then that automatically disincentivizes AI systems from breaking the law.
(I’m not particularly opposed to AI systems being disincentivized directly, e.g. by making it possible to hold AI systems accountable for their actions. It just doesn’t seem necessary in the world where we’ve solved alignment.)
I agree that doing it on the actor’s side is better if you can ensure it for all actors, but you have to also prevent the human principal from getting a different actor that isn’t bound by law.
E.g. if you have a chauffeur who refuses to exceed the speed limit (in a country where the speed limit that’s actually enforced is 10mph higher), you fire that chauffeur and find a different one.
(Also, I’m assuming you’re teaching the agent to follow the law via something like case 2 above, where you have it read the law and understand it using its existing abilities, and then train it somehow to not break the law. If you were instead thinking something like case 1, I’d make the second argument that it isn’t likely to work.)
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they’re used?
EDIT: Hmm, maybe they need strong incentives to check in with law enforcement periodically? This would be bounded per interval of time, and also (much) greater in absolute sign than any other reward they could get per period.
I’m going to interpret this as:
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Assume that enforcement is impossible.
Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)
But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesn’t seem like a strong argument to focus on case 1.
Couldn’t the AI end up misaligned with the owners by accident, even if they’re aligned with the rest of humanity? The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
I edited my comment above before I got your reply to include the possibility of the AI being incentivized to ensure it gets monitored by law enforcement. Its reward function could look like
where f is bounded to have a range of length ≤1, and IMi(x) is 1 if the AI is monitored by law enforcement in period i (and passes some test) and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term can’t evaluate to infinity since that would allow f to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow f to overcome the IMi.
Yes, but as I said earlier, I’m assuming the alignment problem has already been solved when talking about enforcement. I am not proposing enforcement as a solution to alignment.
If you haven’t solved the alignment problem, enforcement doesn’t help much, because you can’t rely on your AI-enabled police to help catch the AI-enabled criminals, because the police AI itself may not be aligned with the police.
Case 2 is assuming that you already have an intelligent agent with motivations, and then trying to deal with that after the fact. I agree this is not going to work for alignment. If for some reason I could only do 1 or 2 for alignment, I would try 1. (But there are in fact a bunch of other things that you can do.)
Isn’t most of this after a crime has already been committed? Is that enough if it’s an existential risk? To handle this, would we want continuous monitoring of autonomous AIs, at which point aren’t we actually just taking their autonomy away?
Also, if we want to automate “detect bad behavior”, wouldn’t that require AI alignment, too? If we don’t fully automate it, then can we be confident that humans can keep up with everything they need to check themselves, given that AIs could work extremely fast? AIs might learn how much work humans can keep up with and then overwhelm them.
Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.
Cullen’s argument was “alignment may not be enough, even if you solve alignment you might still want to program your AI to follow the law because <reasons>.” So in my responses I’ve been assuming that we have solved alignment; I’m arguing that after solving alignment, AI-powered enforcement will probably be enough to handle the problems Cullen is talking about. Some quotes from Cullen’s comment (emphasis mine):
Some responses to your comments:
Yes, I’m assuming we’ve solved alignment here.
Good enforcement is also a deterrent against crime (someone without any qualms about murder will still usually not murder because of the harsh penalties and chance of being caught).
Remember that the police are also AI-enabled, and can find new ways of detecting things. Even so, this is possible: but it’s also possible today, without AI: criminals presumably constantly find new ways of hiding things from the police.
Some reasons why this might be true:
Law is less indeterminate than you might think, and probably more definite than human values
Law has authoritative corpora readily available
Law has built-in, authoritative adjudication/dispute resolution mechanisms. Cf. AI Safety by Debate.
In general, my guess is that there is a large space of actions that:
Are unaligned, and
Are illegal, and
Due to the formality of parts of law and the legal process, an AI can be made to have higher confidence that an action is (2) than (1).
However, it’s very possible that, as you suggest, solving AI legal compliance requires solving AI Safety generally. This seems somewhat unlikely to me but I have low confidence in this since I’m not an expert. :-)
Agreed that “human values” is harder and more indeterminate, because it’s a tricky philosophical problem that may not even have a solution.
I don’t think “alignment” is harder or more indeterminate, where “alignment” means something like “I have in mind something I want the AI system to do, it does that thing, without trying to manipulate me / deceive me etc.”
Like, idk, imagine there was a law that said “All AI systems must not deceive their users, and must do what they believe their users want”. A real law would probably only be slightly more explicit than that? If so, just creating an AI system that followed only this law would lead to something that meets the criterion I’m imagining. Creating an AI system that follows all laws seems a lot harder.
I think this would probably have been true of expert systems but not so true of deep learning-based systems.
Also, personally I find it easier to tell when my actions are unaligned with <person X whom I know> than when my actions are illegal.
Thanks Rohin!
Yeah, I agree with this.
I’m not sure that’s true. (Most) real laws have huge bodies of interpretative text surrounding them and examples of real-world applications of them to real-world facts.
Lawyers approximate generalists: they can take arbitrary written laws and give advice on how to conform behavior to those laws. So a lawyerlike AI might be able to learn general interpretative principles and research skills and be able to simulate legal adjudications of proposed actions.
Interesting; I don’t have good intuitions on this!
Right, I was trying to factor this part out, because it seemed to me that the hope was “the law is explicit and therefore can be programmed in”. But if you want to include all of the interpretative text and examples of real-world application, it starts looking more like “here is a crap ton of data about this law, please understand what this law means and then act in accordance to it”, as opposed to directly hardcoding in the law.
Under this interpretation (which may not be what you meant) this becomes a claim that laws have a lot more data that pinpoints what exactly they mean, relative to something like “what humans want”, and so an AI system will more easily pinpoint it. I’m somewhat sympathetic to this claim, though I think there is a lot of data about “what humans want” in everyday life that the AI can learn from. But my real reason for not caring too much about this is that in this story we rely on the AI’s “intelligence” to “understand” laws, as opposed to “programming it in”; given that we’re worried about superintelligent AI it should be “intelligent” enough to “understand” what humans want as well (given that humans seem to be able to do that).
I’m not sure what you’re trying to imply with this—does this make the AIs task easier? Harder? The generality somehow implies that the AI is safer?
Like, I don’t get why this point has any bearing on whether it is better to train “lawyerlike AI” or “AI that tries to do what humans want”. If anything, I think it pushes in the “do what humans want” direction, since historically it has been very difficult to create generalist AIs, and easier to create specialist AIs.
(Though I’m not sure I think “AI that tries to do what humans want” is less “general” than lawyerlike AI.)
My intuition is that more formal systems will be easier for AI to understand earlier in the “evolution” of SOTA AI intelligence than less-formal systems. Since law is more formal than human values (including both the way it’s written and the formal significance of interpretative texts), then we might get good law-following before good value alignment.
Sorry. I was responding to the “all laws” point. My point was that I think that making a law-following AI that can follow (A) all enumerated laws is not much harder than one that can be made to follow (B) any given law. That is, difficulty of construction scales sub-linearly with the number of laws it needs to follow. The interpretative tools that should get to (B) should be pretty generalizable to (A).
I agree for fully formal systems (e.g. solving SAT problems), but don’t agree for “more formal” systems like law.
Mostly I’m thinking that understanding law would require you to understand language, but once you’ve understood language you also understand “what humans want”. You could imagine a world in which AI systems understand the literal meaning of language but don’t grasp the figurative / pedagogic / Gricean aspects of language, and in that world I think AI systems will understand law earlier than normal English, but that doesn’t seem to be the world we live in:
GPT-2 and other language models don’t seem particularly literal.
We have way more training data about natural language as it is normally used (most of the Internet), relative to natural language meant to be interpreted mostly literally.
Humans find it easier / more “native” to interpret language in the figurative / pedagogic way than to interpret it in the literal way.
Makes sense, that seems true to me.
The key difference in my mind is that the AI system does not need to determine the relative authoritativeness of different pronouncements of human value, since the legal authoritativeness of e.g. caselaw is pretty formalized. But I agree that this is less of an issue if the primary route to alignment is just getting an AI to follow the instructions of its principal.
Yeah, I certainly feel better about learning law relative to learning the One True Set of Human Values That Shall Then Be Optimized Forevermore.
I suspect current laws capture enough of what we care about that if an AGI followed them “properly”, this wouldn’t lead to worse outcomes than without AGI at all in expectation, but there could be holes to exploit and “properly” is where the challenge is, as you suggest. Many laws would have to be interpreted more broadly than before, perhaps.
Isn’t interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do? If the AI can find an interpretation of a law according to which an action would break it with high enough probability, then that action would be ruled out. This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
To illustrate, “Maximize paperclips without killing anyone” is not an interpretation of “Maximize paperclips”, but “Any particular person dies at least 1 day earlier with probability > p than they would have by inaction” could be an interpretation of “produce death” (although it might be better to rewrite laws in more specific numeric terms in the first place).
Defining a good search space (and search method) for interpretations of a given statement might still be a very difficult problem, though.
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include “and also don’t kill anyone”.
I feel like the word “values” makes this sound more complex than it is, and I’d say we instead want the agent to understand and act in line with what the human wants / intends.
This is then also a problem of reasoning and understanding language: when I say “please help me write good education policy laws”, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
That’s what you want, but the sentence “Maximize paperclips” doesn’t imply it through any literal interpretation, nor does “Maximize paperclips” imply “maximize paperclips while killing at least one person”. What I’m looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
I think much more is hidden in “good”, which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
That’s true. I looked at the US Code’s definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of “Any particular person dies at least x earlier with probability > p than they would have by inaction”, or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldn’t need to understand vague and underspecified words like “good”.
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws don’t cover nonidentity cases, as far as I know.)
If you want literal interpretations, specificity, and explicitness, I think you’re in for a bad time:
How do you intend to define “person” in terms of the inputs to an AI system (let’s assume a camera image)? How do you compute the “probability” of an event? What is “inaction”?
(There’s also the problem that all actions probably change who does and doesn’t exists, so this law would require the AI system to always take inaction, making it useless.)
Can we just define them as we normally do, e.g. biologically with a functioning brain? Is the concern that AIs won’t be able to tell which inputs represent real things from those that won’t? Or they just won’t be able to apply the definitions correctly generally enough?
The AI would do this. Are AIs that aren’t good at estimating probabilities of events smart enough to worry about? I suppose they could be good at estimating probabilities in specific domains but not generally, or have some very specific failure cases that could be catastrophic.
The AI waits for the next request, turns off or some other inconsequential default action.
Maybe my wording didn’t capture this well, but my intention was a presentist/necessitarian person-affecting approach (not that I agree with the ethical position). I’ll try again:
“A particular person will have been born with action A and with inaction, and will die at least x earlier with probability > p with A than they would have with inaction.”
How do you define “biological” and “brain”? Again, your input is a camera image, so you have to build this up starting from sentences of the form “the pixel in the top left corner is this shade of grey”.
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Let’s take AIs trained by deep reinforcement learning as an example. If you want to encode something like “Any particular person dies at least x earlier with probability > p than they would have by inaction” explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to “do what the user wants”.
If you’re a self-driving car, it’s very unclear what an inconsequential default action is. (Though I agree in general there’s often some default action that is fine.)
I mean, the existence part was not the main point—my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can’t predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I’m not arguing that these sorts of butterfly effects are real—I’m not sure—but it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.
Maybe this cuts to the chase: Should we expect AIs to be able to know or do anything in particular well “enough”. I.e. is there one thing in particular we can say AIs will be good at and only get wrong extremely rarely? Is solving this as hard as technical AI alignment in general?
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
This could be an explicit output we train the AI to predict (possibly part of responses in language).
I “named” a particular person in that sentence. The probability that what I do leads to an earlier death for John Doe is extremely small, and that’s the probability that I’m constraining, for each person separately. This will also in practice prevent the AI from conducting murder lotteries up to a certain probability of being killed, but this probability might be too high, so you could have separate constraints for causing an earlier death for a random person or on the change in average life expectancy in the world to prevent, etc..
It really sounds like this sort of training is going to require it to be able to interpret English the way we interpret English (e.g. to read biology textbooks); if you’re going to rely on that I don’t see why you don’t want to rely on that ability when we are giving it instructions.
That… is ambitious, if you want to do this for every term that exists in laws. But I agree that if you did this, you could try to “translate” laws into code in a literal fashion. I’m fairly confident that this would still be pretty far from what you wanted, because laws aren’t meant to be literal, but I’m not going to try to argue that here.
(Also, it probably wouldn’t be computationally efficient—that “don’t kill a person” law, to be implemented literally in code, would require you to loop over all people, and make a prediction for each one: extremely expensive.)
Ah, I see. In that case I take back my objection about butterfly effects.
(I am a lawyer by training.)
Yes, this is certainly true. Many laws explicitly or implicitly rely on standards (i.e., less-definite adjudicatory formulas) than hard-and-fast rules. “Reasonableness,” for example, is often a key term in a legal claim or defense. Juries often make such determinations, which also means whether the actual legality of an action is resolved upon adjudication and not ex ante (although an aligned, capable AI could in principle simulate the probability that a jury would find its actions reasonable—that’s what lawyers do.)
Doesn’t “wants / intends” makes this sound less complex than it is? To me this phrasing connotes (not to say you actually believe this) that the goal is for AIs to understand short-term human desires, without accounting for ways in which our wants contradict what we would value in the long term, or ways that individuals’ wants can conflict. Once we add caveats like “what we would want / intend after sufficient rational reflection,” my sense is that “values” just captures that more intuitively. I haven’t surveyed people on this, though, so this definitely isn’t a confident claim on my part.
I in fact don’t want to add in those caveats here: I’m suggesting that we tell our AI system to do what we short-term want. (Of course, we can then “short-term want” to do more rational reflection, or to be informed of true and useful things that help us make moral progress, etc.)
I agree that “values” more intuitively captures the thing with all the caveats added in.