I suspect current laws capture enough of what we care about that if an AGI followed them âproperlyâ, this wouldnât lead to worse outcomes than without AGI at all in expectation, but there could be holes to exploit and âproperlyâ is where the challenge is, as you suggest. Many laws would have to be interpreted more broadly than before, perhaps.
You might say that we could train an AI system to learn what is and isnât breaking the law; but then you might as well train an AI system to learn what is and isnât the thing you want it to do.
Isnât interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do? If the AI can find an interpretation of a law according to which an action would break it with high enough probability, then that action would be ruled out. This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
To illustrate, âMaximize paperclips without killing anyoneâ is not an interpretation of âMaximize paperclipsâ, but âAny particular person dies at least 1 day earlier with probability > p than they would have by inactionâ could be an interpretation of âproduce deathâ (although it might be better to rewrite laws in more specific numeric terms in the first place).
Defining a good search space (and search method) for interpretations of a given statement might still be a very difficult problem, though.
To illustrate, âMaximize paperclips without killing anyoneâ is not an interpretation of âMaximize paperclipsâ
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include âand also donât kill anyoneâ.
This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
I feel like the word âvaluesâ makes this sound more complex than it is, and Iâd say we instead want the agent to understand and act in line with what the human wants /â intends.
This is then also a problem of reasoning and understanding language: when I say âplease help me write good education policy lawsâ, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
Isnât interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do?
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include âand also donât kill anyoneâ.
Thatâs what you want, but the sentence âMaximize paperclipsâ doesnât imply it through any literal interpretation, nor does âMaximize paperclipsâ imply âmaximize paperclips while killing at least one personâ. What Iâm looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
This is then also a problem of reasoning and understanding language: when I say âplease help me write good education policy lawsâ, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
I think much more is hidden in âgoodâ, which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
Thatâs true. I looked at the US Codeâs definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of âAny particular person dies at least x earlier with probability > p than they would have by inactionâ, or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldnât need to understand vague and underspecified words like âgoodâ.
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws donât cover nonidentity cases, as far as I know.)
If you want literal interpretations, specificity, and explicitness, I think youâre in for a bad time:
âAny particular person dies at least x earlier with probability > p than they would have by inactionâ
How do you intend to define âpersonâ in terms of the inputs to an AI system (letâs assume a camera image)? How do you compute the âprobabilityâ of an event? What is âinactionâ?
(Thereâs also the problem that all actions probably change who does and doesnât exists, so this law would require the AI system to always take inaction, making it useless.)
How do you intend to define âpersonâ in terms of the inputs to an AI system (letâs assume a camera image)?
Can we just define them as we normally do, e.g. biologically with a functioning brain? Is the concern that AIs wonât be able to tell which inputs represent real things from those that wonât? Or they just wonât be able to apply the definitions correctly generally enough?
How do you compute the âprobabilityâ of an event?
The AI would do this. Are AIs that arenât good at estimating probabilities of events smart enough to worry about? I suppose they could be good at estimating probabilities in specific domains but not generally, or have some very specific failure cases that could be catastrophic.
What is âinactionâ?
The AI waits for the next request, turns off or some other inconsequential default action.
(Thereâs also the problem that all actions probably change who does and doesnât exists, so this law would require the AI system to always take inaction, making it useless.)
Maybe my wording didnât capture this well, but my intention was a presentist/ânecessitarian person-affecting approach (not that I agree with the ethical position). Iâll try again:
âA particular person will have been born with action A and with inaction, and will die at least x earlier with probability > p with A than they would have with inaction.â
Can we just define them as we normally do, e.g. biologically with a functioning brain?
How do you define âbiologicalâ and âbrainâ? Again, your input is a camera image, so you have to build this up starting from sentences of the form âthe pixel in the top left corner is this shade of greyâ.
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
The AI would do this. Are AIs that arenât good at estimating probabilities of events smart enough to worry about?
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Letâs take AIs trained by deep reinforcement learning as an example. If you want to encode something like âAny particular person dies at least x earlier with probability > p than they would have by inactionâ explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to âdo what the user wantsâ.
The AI waits for the next request, turns off or some other inconsequential default action.
If youâre a self-driving car, itâs very unclear what an inconsequential default action is. (Though I agree in general thereâs often some default action that is fine.)
Maybe my wording didnât capture this well, but my intention was a presentist/ânecessitarian person-affecting approach (not that I agree with the ethical position).
I mean, the existence part was not the main pointâmy point was that if butterfly effects are real, then the AI system must always do nothing (even if it canât predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
Iâm not arguing that these sorts of butterfly effects are realâIâm not sureâbut it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.
Maybe this cuts to the chase: Should we expect AIs to be able to know or do anything in particular well âenoughâ. I.e. is there one thing in particular we can say AIs will be good at and only get wrong extremely rarely? Is solving this as hard as technical AI alignment in general?
How do you define âbiologicalâ and âbrainâ? Again, your input is a camera image, so you have to build this up starting from sentences of the form âthe pixel in the top left corner is this shade of greyâ.
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself.
This could be an explicit output we train the AI to predict (possibly part of responses in language).
I mean, the existence part was not the main pointâmy point was that if butterfly effects are real, then the AI system must always do nothing (even if it canât predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
I ânamedâ a particular person in that sentence. The probability that what I do leads to an earlier death for John Doe is extremely small, and thatâs the probability that Iâm constraining, for each person separately. This will also in practice prevent the AI from conducting murder lotteries up to a certain probability of being killed, but this probability might be too high, so you could have separate constraints for causing an earlier death for a random person or on the change in average life expectancy in the world to prevent, etc..
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
It really sounds like this sort of training is going to require it to be able to interpret English the way we interpret English (e.g. to read biology textbooks); if youâre going to rely on that I donât see why you donât want to rely on that ability when we are giving it instructions.
This could be an explicit output we train the AI to predict (possibly part of responses in language).
That⊠is ambitious, if you want to do this for every term that exists in laws. But I agree that if you did this, you could try to âtranslateâ laws into code in a literal fashion. Iâm fairly confident that this would still be pretty far from what you wanted, because laws arenât meant to be literal, but Iâm not going to try to argue that here.
(Also, it probably wouldnât be computationally efficientâthat âdonât kill a personâ law, to be implemented literally in code, would require you to loop over all people, and make a prediction for each one: extremely expensive.)
I ânamedâ a particular person in that sentence.
Ah, I see. In that case I take back my objection about butterfly effects.
I feel like the word âvaluesâ makes this sound more complex than it is, and Iâd say we instead want the agent to understand and act in line with what the human wants /â intends.
Doesnât âwants /â intendsâ makes this sound less complex than it is? To me this phrasing connotes (not to say you actually believe this) that the goal is for AIs to understand short-term human desires, without accounting for ways in which our wants contradict what we would value in the long term, or ways that individualsâ wants can conflict. Once we add caveats like âwhat we would want /â intend after sufficient rational reflection,â my sense is that âvaluesâ just captures that more intuitively. I havenât surveyed people on this, though, so this definitely isnât a confident claim on my part.
Once we add caveats like âwhat we would want /â intend after sufficient rational reflection,â my sense is that âvaluesâ just captures that more intuitively.
I in fact donât want to add in those caveats here: Iâm suggesting that we tell our AI system to do what we short-term want. (Of course, we can then âshort-term wantâ to do more rational reflection, or to be informed of true and useful things that help us make moral progress, etc.)
I agree that âvaluesâ more intuitively captures the thing with all the caveats added in.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
(I am a lawyer by training.)
Yes, this is certainly true. Many laws explicitly or implicitly rely on standards (i.e., less-definite adjudicatory formulas) than hard-and-fast rules. âReasonableness,â for example, is often a key term in a legal claim or defense. Juries often make such determinations, which also means whether the actual legality of an action is resolved upon adjudication and not ex ante (although an aligned, capable AI could in principle simulate the probability that a jury would find its actions reasonableâthatâs what lawyers do.)
I suspect current laws capture enough of what we care about that if an AGI followed them âproperlyâ, this wouldnât lead to worse outcomes than without AGI at all in expectation, but there could be holes to exploit and âproperlyâ is where the challenge is, as you suggest. Many laws would have to be interpreted more broadly than before, perhaps.
Isnât interpreting statements (e.g. laws) and checking if they apply to a given action a narrower, more structured and better-defined problem than getting AI to do what we want it to do? If the AI can find an interpretation of a law according to which an action would break it with high enough probability, then that action would be ruled out. This seems like it could be a problem of reasoning and understanding language, instead of the problem of understanding and acting in line with human values.
To illustrate, âMaximize paperclips without killing anyoneâ is not an interpretation of âMaximize paperclipsâ, but âAny particular person dies at least 1 day earlier with probability > p than they would have by inactionâ could be an interpretation of âproduce deathâ (although it might be better to rewrite laws in more specific numeric terms in the first place).
Defining a good search space (and search method) for interpretations of a given statement might still be a very difficult problem, though.
Huh? If I ask someone to manage my paperclip factory, I certainly do expect them to interpret that request to include âand also donât kill anyoneâ.
I feel like the word âvaluesâ makes this sound more complex than it is, and Iâd say we instead want the agent to understand and act in line with what the human wants /â intends.
This is then also a problem of reasoning and understanding language: when I say âplease help me write good education policy lawsâ, if it understands language and reason, and acts based on that, that seems pretty aligned to me.
I am not a law expert, but my impression is that there is a lot of common sense + human judgment in the application of laws, just as there is a lot of common sense + human judgment in interpreting requests.
Thatâs what you want, but the sentence âMaximize paperclipsâ doesnât imply it through any literal interpretation, nor does âMaximize paperclipsâ imply âmaximize paperclips while killing at least one personâ. What Iâm looking for is logical equivalence, and adding qualifiers about whether or not people are killed breaks equivalence.
I think much more is hidden in âgoodâ, which is something people have a problem specifying fully and explicitly. The law is more specific and explicit, although it could be improved significantly.
Thatâs true. I looked at the US Codeâs definition of manslaughter and it could, upon a literal interpretation, imply that helping someone procreate is manslaughter, because bringing someone into existence causes their death. That law would have to be rewritten, perhaps along the lines of âAny particular person dies at least x earlier with probability > p than they would have by inactionâ, or something closer to the definition of stochastic dominance for time of death (it could be a disjunction of statements). These are just first attempts, but I think they could be refined enough to capture a prohibition on killing humans to our satisfaction, and the AI wouldnât need to understand vague and underspecified words like âgoodâ.
We would then do this one by one for each law, but spend a disproportionate amount of time on the more important laws to get them right.
(Note that laws donât cover nonidentity cases, as far as I know.)
If you want literal interpretations, specificity, and explicitness, I think youâre in for a bad time:
How do you intend to define âpersonâ in terms of the inputs to an AI system (letâs assume a camera image)? How do you compute the âprobabilityâ of an event? What is âinactionâ?
(Thereâs also the problem that all actions probably change who does and doesnât exists, so this law would require the AI system to always take inaction, making it useless.)
Can we just define them as we normally do, e.g. biologically with a functioning brain? Is the concern that AIs wonât be able to tell which inputs represent real things from those that wonât? Or they just wonât be able to apply the definitions correctly generally enough?
The AI would do this. Are AIs that arenât good at estimating probabilities of events smart enough to worry about? I suppose they could be good at estimating probabilities in specific domains but not generally, or have some very specific failure cases that could be catastrophic.
The AI waits for the next request, turns off or some other inconsequential default action.
Maybe my wording didnât capture this well, but my intention was a presentist/ânecessitarian person-affecting approach (not that I agree with the ethical position). Iâll try again:
âA particular person will have been born with action A and with inaction, and will die at least x earlier with probability > p with A than they would have with inaction.â
How do you define âbiologicalâ and âbrainâ? Again, your input is a camera image, so you have to build this up starting from sentences of the form âthe pixel in the top left corner is this shade of greyâ.
(Or you can choose some other input, as long as we actually have existing technology that can create that input.)
Powerful AIs will certainly behave in ways that make it look like they are estimating probabilities.
Letâs take AIs trained by deep reinforcement learning as an example. If you want to encode something like âAny particular person dies at least x earlier with probability > p than they would have by inactionâ explicitly and literally in code, you will need functions like getAllPeople() and getProbability(event). AIs do not usually come equipped with such functions, so you either have to say how to use the AI system to implement those functions, or you have to implement them yourself. I am claiming that the second option is hard, and any solution you have for the first option will probably also work for something like telling the AI system to âdo what the user wantsâ.
If youâre a self-driving car, itâs very unclear what an inconsequential default action is. (Though I agree in general thereâs often some default action that is fine.)
I mean, the existence part was not the main pointâmy point was that if butterfly effects are real, then the AI system must always do nothing (even if it canât predict what the butterfly effects would be). If you want to avoid debates about population ethics, you could imagine butterfly effects that affect current people: e.g. you slightly change who talks to whom, which changes whether a person gets hit by a car later in the day or not.
Iâm not arguing that these sorts of butterfly effects are realâIâm not sureâbut it seems bad for the behavior of our AI system to depend so strongly on whether butterfly effects are real.
Maybe this cuts to the chase: Should we expect AIs to be able to know or do anything in particular well âenoughâ. I.e. is there one thing in particular we can say AIs will be good at and only get wrong extremely rarely? Is solving this as hard as technical AI alignment in general?
These are things it would be trained to learn. It would learn to read and could read biology textbooks and papers or things online, and it would also see pictures of people, brains, etc..
This could be an explicit output we train the AI to predict (possibly part of responses in language).
I ânamedâ a particular person in that sentence. The probability that what I do leads to an earlier death for John Doe is extremely small, and thatâs the probability that Iâm constraining, for each person separately. This will also in practice prevent the AI from conducting murder lotteries up to a certain probability of being killed, but this probability might be too high, so you could have separate constraints for causing an earlier death for a random person or on the change in average life expectancy in the world to prevent, etc..
It really sounds like this sort of training is going to require it to be able to interpret English the way we interpret English (e.g. to read biology textbooks); if youâre going to rely on that I donât see why you donât want to rely on that ability when we are giving it instructions.
That⊠is ambitious, if you want to do this for every term that exists in laws. But I agree that if you did this, you could try to âtranslateâ laws into code in a literal fashion. Iâm fairly confident that this would still be pretty far from what you wanted, because laws arenât meant to be literal, but Iâm not going to try to argue that here.
(Also, it probably wouldnât be computationally efficientâthat âdonât kill a personâ law, to be implemented literally in code, would require you to loop over all people, and make a prediction for each one: extremely expensive.)
Ah, I see. In that case I take back my objection about butterfly effects.
Doesnât âwants /â intendsâ makes this sound less complex than it is? To me this phrasing connotes (not to say you actually believe this) that the goal is for AIs to understand short-term human desires, without accounting for ways in which our wants contradict what we would value in the long term, or ways that individualsâ wants can conflict. Once we add caveats like âwhat we would want /â intend after sufficient rational reflection,â my sense is that âvaluesâ just captures that more intuitively. I havenât surveyed people on this, though, so this definitely isnât a confident claim on my part.
I in fact donât want to add in those caveats here: Iâm suggesting that we tell our AI system to do what we short-term want. (Of course, we can then âshort-term wantâ to do more rational reflection, or to be informed of true and useful things that help us make moral progress, etc.)
I agree that âvaluesâ more intuitively captures the thing with all the caveats added in.
(I am a lawyer by training.)
Yes, this is certainly true. Many laws explicitly or implicitly rely on standards (i.e., less-definite adjudicatory formulas) than hard-and-fast rules. âReasonableness,â for example, is often a key term in a legal claim or defense. Juries often make such determinations, which also means whether the actual legality of an action is resolved upon adjudication and not ex ante (although an aligned, capable AI could in principle simulate the probability that a jury would find its actions reasonableâthatâs what lawyers do.)