To illustrate, “Maximize paperclips without killing anyone” is not an interpretation of “Maximize paperclips”
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include “and also don’t kill anyone”.
This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
I feel like the word “values” makes this sound more complex than it is, and I’d say we instead want the agent to understand and act in line with what the human wants / intends.
This is then also a problem of reasoning and understanding language: when I say “please help me write good education policy laws”, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
Isn’t interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do?
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include “and also don’t kill anyone”.
That’s what you want, but the sentence “Maximize paperclips” doesn’t imply it through any literal interpretation, nor does “Maximize paperclips” imply “maximize paperclips while killing at least one person”. What I’m looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
This is then also a problem of reasoning and understanding language: when I say “please help me write good education policy laws”, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
I think much more is hidden in “good”, which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
That’s true. I looked at the US Code’s definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of “Any particular person dies at least x earlier with probability > p than they would have by inaction”, or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldn’t need to understand vague and underspecified words like “good”.
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws don’t cover nonidentity cases, as far as I know.)
If you want literal interpretations, specificity, and explicitness, I think you’re in for a bad time:
“Any particular person dies at least x earlier with probability > p than they would have by inaction”
How do you intend to define “person” in terms of the inputs to an AI system (let’s assume a camera image)? How do you compute the “probability” of an event? What is “inaction”?
(There’s also the problem that all actions probably change who does and doesn’t exists, so this law would require the AI system to always take inaction, making it useless.)
How do you intend to define “person” in terms of the inputs to an AI system (let’s assume a camera image)?
Can we just define them as we normally do, e.g. biologically with a functioning brain? Is the concern that AIs won’t be able to tell which inputs represent real things from those that won’t? Or they just won’t be able to apply the definitions correctly generally enough?
How do you compute the “probability” of an event?
The AI would do this. Are AIs that aren’t good at estimating probabilities of events smart enough to worry about? I suppose they could be good at estimating probabilities in specific domains but not generally, or have some very specific failure cases that could be catastrophic.
What is “inaction”?
The AI waits for the next request, turns off or some other inconsequential default action.
(There’s also the problem that all actions probably change who does and doesn’t exists, so this law would require the AI system to always take inaction, making it useless.)
Maybe my wording didn’t capture this well, but my intention was a presentist/necessitarian person-affecting approach (not that I agree with the ethical position). I’ll try again:
“A particular person will have been born with action A and with inaction, and will die at least x earlier with probability > p with A than they would have with inaction.”
Can we just define them as we normally do, e.g. biologically with a functioning brain?
How do you define “biological” and “brain”? Again, your input is a camera image, so you have to build this up starting from sentences of the form “the pixel in the top left corner is this shade of grey”.
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
The AI would do this. Are AIs that aren’t good at estimating probabilities of events smart enough to worry about?
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Let’s take AIs trained by deep reinforcement learning as an example. If you want to encode something like “Any particular person dies at least x earlier with probability > p than they would have by inaction” explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to “do what the user wants”.
The AI waits for the next request, turns off or some other inconsequential default action.
If you’re a self-driving car, it’s very unclear what an inconsequential default action is. (Though I agree in general there’s often some default action that is fine.)
Maybe my wording didn’t capture this well, but my intention was a presentist/necessitarian person-affecting approach (not that I agree with the ethical position).
I mean, the existence part was not the main point—my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can’t predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I’m not arguing that these sorts of butterfly effects are real—I’m not sure—but it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.
Maybe this cuts to the chase: Should we expect AIs to be able to know or do anything in particular well “enough”. I.e. is there one thing in particular we can say AIs will be good at and only get wrong extremely rarely? Is solving this as hard as technical AI alignment in general?
How do you define “biological” and “brain”? Again, your input is a camera image, so you have to build this up starting from sentences of the form “the pixel in the top left corner is this shade of grey”.
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself.
This could be an explicit output we train the AI to predict (possibly part of responses in language).
I mean, the existence part was not the main point—my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can’t predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I “named” a particular person in that sentence. The probability that what I do leads to an earlier death for John Doe is extremely small, and that’s the probability that I’m constraining, for each person separately. This will also in practice prevent the AI from conducting murder lotteries up to a certain probability of being killed, but this probability might be too high, so you could have separate constraints for causing an earlier death for a random person or on the change in average life expectancy in the world to prevent, etc..
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
It really sounds like this sort of training is going to require it to be able to interpret English the way we interpret English (e.g. to read biology textbooks); if you’re going to rely on that I don’t see why you don’t want to rely on that ability when we are giving it instructions.
This could be an explicit output we train the AI to predict (possibly part of responses in language).
That… is ambitious, if you want to do this for every term that exists in laws. But I agree that if you did this, you could try to “translate” laws into code in a literal fashion. I’m fairly confident that this would still be pretty far from what you wanted, because laws aren’t meant to be literal, but I’m not going to try to argue that here.
(Also, it probably wouldn’t be computationally efficient—that “don’t kill a person” law, to be implemented literally in code, would require you to loop over all people, and make a prediction for each one: extremely expensive.)
I “named” a particular person in that sentence.
Ah, I see. In that case I take back my objection about butterfly effects.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
(I am a lawyer by training.)
Yes, this is certainly true. Many laws explicitly or implicitly rely on standards (i.e., less-definite adjudicatory formulas) than hard-and-fast rules. “Reasonableness,” for example, is often a key term in a legal claim or defense. Juries often make such determinations, which also means whether the actual legality of an action is resolved upon adjudication and not ex ante (although an aligned, capable AI could in principle simulate the probability that a jury would find its actions reasonable—that’s what lawyers do.)
I feel like the word “values” makes this sound more complex than it is, and I’d say we instead want the agent to understand and act in line with what the human wants / intends.
Doesn’t “wants / intends” makes this sound less complex than it is? To me this phrasing connotes (not to say you actually believe this) that the goal is for AIs to understand short-term human desires, without accounting for ways in which our wants contradict what we would value in the long term, or ways that individuals’ wants can conflict. Once we add caveats like “what we would want / intend after sufficient rational reflection,” my sense is that “values” just captures that more intuitively. I haven’t surveyed people on this, though, so this definitely isn’t a confident claim on my part.
Once we add caveats like “what we would want / intend after sufficient rational reflection,” my sense is that “values” just captures that more intuitively.
I in fact don’t want to add in those caveats here: I’m suggesting that we tell our AI system to do what we short-term want. (Of course, we can then “short-term want” to do more rational reflection, or to be informed of true and useful things that help us make moral progress, etc.)
I agree that “values” more intuitively captures the thing with all the caveats added in.
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include “and also don’t kill anyone”.
I feel like the word “values” makes this sound more complex than it is, and I’d say we instead want the agent to understand and act in line with what the human wants / intends.
This is then also a problem of reasoning and understanding language: when I say “please help me write good education policy laws”, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
That’s what you want, but the sentence “Maximize paperclips” doesn’t imply it through any literal interpretation, nor does “Maximize paperclips” imply “maximize paperclips while killing at least one person”. What I’m looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
I think much more is hidden in “good”, which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
That’s true. I looked at the US Code’s definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of “Any particular person dies at least x earlier with probability > p than they would have by inaction”, or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldn’t need to understand vague and underspecified words like “good”.
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws don’t cover nonidentity cases, as far as I know.)
If you want literal interpretations, specificity, and explicitness, I think you’re in for a bad time:
How do you intend to define “person” in terms of the inputs to an AI system (let’s assume a camera image)? How do you compute the “probability” of an event? What is “inaction”?
(There’s also the problem that all actions probably change who does and doesn’t exists, so this law would require the AI system to always take inaction, making it useless.)
Can we just define them as we normally do, e.g. biologically with a functioning brain? Is the concern that AIs won’t be able to tell which inputs represent real things from those that won’t? Or they just won’t be able to apply the definitions correctly generally enough?
The AI would do this. Are AIs that aren’t good at estimating probabilities of events smart enough to worry about? I suppose they could be good at estimating probabilities in specific domains but not generally, or have some very specific failure cases that could be catastrophic.
The AI waits for the next request, turns off or some other inconsequential default action.
Maybe my wording didn’t capture this well, but my intention was a presentist/necessitarian person-affecting approach (not that I agree with the ethical position). I’ll try again:
“A particular person will have been born with action A and with inaction, and will die at least x earlier with probability > p with A than they would have with inaction.”
How do you define “biological” and “brain”? Again, your input is a camera image, so you have to build this up starting from sentences of the form “the pixel in the top left corner is this shade of grey”.
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Let’s take AIs trained by deep reinforcement learning as an example. If you want to encode something like “Any particular person dies at least x earlier with probability > p than they would have by inaction” explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to “do what the user wants”.
If you’re a self-driving car, it’s very unclear what an inconsequential default action is. (Though I agree in general there’s often some default action that is fine.)
I mean, the existence part was not the main point—my point was that if butterfly effects are real, then the AI system must always do nothing (even if it can’t predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I’m not arguing that these sorts of butterfly effects are real—I’m not sure—but it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.
Maybe this cuts to the chase: Should we expect AIs to be able to know or do anything in particular well “enough”. I.e. is there one thing in particular we can say AIs will be good at and only get wrong extremely rarely? Is solving this as hard as technical AI alignment in general?
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
This could be an explicit output we train the AI to predict (possibly part of responses in language).
I “named” a particular person in that sentence. The probability that what I do leads to an earlier death for John Doe is extremely small, and that’s the probability that I’m constraining, for each person separately. This will also in practice prevent the AI from conducting murder lotteries up to a certain probability of being killed, but this probability might be too high, so you could have separate constraints for causing an earlier death for a random person or on the change in average life expectancy in the world to prevent, etc..
It really sounds like this sort of training is going to require it to be able to interpret English the way we interpret English (e.g. to read biology textbooks); if you’re going to rely on that I don’t see why you don’t want to rely on that ability when we are giving it instructions.
That… is ambitious, if you want to do this for every term that exists in laws. But I agree that if you did this, you could try to “translate” laws into code in a literal fashion. I’m fairly confident that this would still be pretty far from what you wanted, because laws aren’t meant to be literal, but I’m not going to try to argue that here.
(Also, it probably wouldn’t be computationally efficient—that “don’t kill a person” law, to be implemented literally in code, would require you to loop over all people, and make a prediction for each one: extremely expensive.)
Ah, I see. In that case I take back my objection about butterfly effects.
(I am a lawyer by training.)
Yes, this is certainly true. Many laws explicitly or implicitly rely on standards (i.e., less-definite adjudicatory formulas) than hard-and-fast rules. “Reasonableness,” for example, is often a key term in a legal claim or defense. Juries often make such determinations, which also means whether the actual legality of an action is resolved upon adjudication and not ex ante (although an aligned, capable AI could in principle simulate the probability that a jury would find its actions reasonable—that’s what lawyers do.)
Doesn’t “wants / intends” makes this sound less complex than it is? To me this phrasing connotes (not to say you actually believe this) that the goal is for AIs to understand short-term human desires, without accounting for ways in which our wants contradict what we would value in the long term, or ways that individuals’ wants can conflict. Once we add caveats like “what we would want / intend after sufficient rational reflection,” my sense is that “values” just captures that more intuitively. I haven’t surveyed people on this, though, so this definitely isn’t a confident claim on my part.
I in fact don’t want to add in those caveats here: I’m suggesting that we tell our AI system to do what we short-term want. (Of course, we can then “short-term want” to do more rational reflection, or to be informed of true and useful things that help us make moral progress, etc.)
I agree that “values” more intuitively captures the thing with all the caveats added in.