My guess is that programming AI to follow law might be easier or preferable to enforcing against human-principals. A weakly aligned AI (not X-risk or risk to principals, but not bound by law or general human morality) deployed by a human principal will probably come across illegal ways to advance its principalâs goals. It will also probably be able to hide its actions, obscure its motives, and/âor evade detection better than humans could. If so, the equilibrium strategy is to give minimal oversight to the AI agent and tacitly allow it to break the law while advancing the principalâs goals, since enforcement against the principal is unlikely. This seems bad!
I agree that getting a guarantee of following the law is (probably) better than trying to ensure it through enforcement, all else equal. I also agree that in principle programming the AI to follow the law could give such a guarantee. So in some normative sense, I agree that it would be better if it were programmed to follow the law.
My main argument here is that it is not worth the effort. This factors into two claims:
First, it would be hard to do. I am a programmer /â ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers /â ML researchers would agree with me on this.
Second, it doesnât provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.
It will also probably be able to hide its actions, obscure its motives, and/âor evade detection better than humans could.
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the policeâs job easier.
First, it would be hard to do. I am a programmer /â ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers /â ML researchers would agree with me on this.
This is valuable information. However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.
Second, it doesnât provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the policeâs job easier.
This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.
Certainly you still need legal accountabilityâwhy wouldnât we have that? If we solve alignment, then we can just have the AIâs owner be accountable for any law-breaking actions the AI takes.
This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say âthe computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager wonât take itâ.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
When you say âprogramming AI to follow lawâ I imagine case 1 above (but for AI systems instead of humans). Certainly the OP seemed to be arguing for this case. This is the thing I think is extremely difficult.
I am much happier about AI systems learning about the law via case 2 above, which would enable the AI police applications I mentioned above.
However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.
I suspect they are thinking about case 2 above? Or they might be thinking of self-driving car type applications where you have an in-code representation of the world? Idk, I feel confident enough of this that Iâd predict that there is a miscommunication somewhere, rather than an actual strong difference of opinion between me and them.
Certainly you still need legal accountabilityâwhy wouldnât we have that? If we solve alignment, then we can just have the AIâs owner be accountable for any law-breaking actions the AI takes.
I agree that that is a very good and desirable step to take. However, as I said, it also incentives the AI-agent to obfuscate its actions and intentions to save its principal. In the human context, human agents do this but are independently disincentivized from breaking the law they face legal liability (a disincentive) for their actions. I want (and I suspect you also want) AI systems to have such incentivization.
If I understand correctly, you identify two ways to do this in the teenager analogy:
Rewiring
Explaining laws and their consequences and letting the agentâs existing incentives do the rest.
I could be wrong about this, but ultimately, for AI systems, it seems like both are actually similarly difficult. As youâve said, for 2. to be most effective, you probably need âAI police.â Those police will need a way of interpreting the legality of an AI agentâs {âmentalâ state; actions} and mapping them only existing laws.
But if you need to do that for effective enforcement, I donât see why (from a societal perspective) we shouldnât just do that on the actorâs side and not the âpoliceâsâ side. Baking the enforcement into the agents has the benefits of:
Not incentivizing an arms race
Giving the enforcerâs a clearer picture of the AIâs âmental stateâ
I want (and I suspect you also want) AI systems to have such incentivization.
Not obviously. My point is just that if the AI is aligned with an human principal, and that human principal can be held accountable for the AIâs actions, then that automatically disincentivizes AI systems from breaking the law.
(Iâm not particularly opposed to AI systems being disincentivized directly, e.g. by making it possible to hold AI systems accountable for their actions. It just doesnât seem necessary in the world where weâve solved alignment.)
I donât see why (from a societal perspective) we shouldnât just do that on the actorâs side and not the âpoliceâsâ side.
I agree that doing it on the actorâs side is better if you can ensure it for all actors, but you have to also prevent the human principal from getting a different actor that isnât bound by law.
E.g. if you have a chauffeur who refuses to exceed the speed limit (in a country where the speed limit thatâs actually enforced is 10mph higher), you fire that chauffeur and find a different one.
(Also, Iâm assuming youâre teaching the agent to follow the law via something like case 2 above, where you have it read the law and understand it using its existing abilities, and then train it somehow to not break the law. If you were instead thinking something like case 1, Iâd make the second argument that it isnât likely to work.)
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say âthe computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager wonât take itâ.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if theyâre used?
EDIT: Hmm, maybe they need strong incentives to check in with law enforcement periodically? This would be bounded per interval of time, and also (much) greater in absolute sign than any other reward they could get per period.
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if theyâre used?
Iâm going to interpret this as:
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Assume that enforcement is impossible.
Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)
But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesnât seem like a strong argument to focus on case 1.
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Couldnât the AI end up misaligned with the owners by accident, even if theyâre aligned with the rest of humanity? The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
I edited my comment above before I got your reply to include the possibility of the AI being incentivized to ensure it gets monitored by law enforcement. Its reward function could look like
f(x)+ââi=1IMi(x)
where f is bounded to have a range of length â€1, and IMi(x) is 1 if the AI is monitored by law enforcement in period i (and passes some test) and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term canât evaluate to infinity since that would allow f to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow f to overcome the IMi.
Couldnât the AI end up misaligned with the owners by accident, even if theyâre aligned with the rest of humanity?
Yes, but as I said earlier, Iâm assuming the alignment problem has already been solved when talking about enforcement. I am not proposing enforcement as a solution to alignment.
If you havenât solved the alignment problem, enforcement doesnât help much, because you canât rely on your AI-enabled police to help catch the AI-enabled criminals, because the police AI itself may not be aligned with the police.
The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
Case 2 is assuming that you already have an intelligent agent with motivations, and then trying to deal with that after the fact. I agree this is not going to work for alignment. If for some reason I could only do 1 or 2 for alignment, I would try 1. (But there are in fact a bunch of other things that you can do.)
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the policeâs job easier.
Isnât most of this after a crime has already been committed? Is that enough if itâs an existential risk? To handle this, would we want continuous monitoring of autonomous AIs, at which point arenât we actually just taking their autonomy away?
Also, if we want to automate âdetect bad behaviorâ, wouldnât that require AI alignment, too? If we donât fully automate it, then can we be confident that humans can keep up with everything they need to check themselves, given that AIs could work extremely fast? AIs might learn how much work humans can keep up with and then overwhelm them.
Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.
Cullenâs argument was âalignment may not be enough, even if you solve alignment you might still want to program your AI to follow the law because <reasons>.â So in my responses Iâve been assuming that we have solved alignment; Iâm arguing that after solving alignment, AI-powered enforcement will probably be enough to handle the problems Cullen is talking about. Some quotes from Cullenâs comment (emphasis mine):
Reasons other than directly getting value alignment from law that you might want to program AI to follow the law
We will presumably want organizations with AI to be bound by law.
We donât want to rely on the incentives of human principals to ensure their agents advance their goals in purely legal ways
Some responses to your comments:
if we want to automate âdetect bad behaviorâ, wouldnât that require AI alignment, too?
Isnât most of this after a crime has already been committed?
Good enforcement is also a deterrent against crime (someone without any qualms about murder will still usually not murder because of the harsh penalties and chance of being caught).
Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.
Remember that the police are also AI-enabled, and can find new ways of detecting things. Even so, this is possible: but itâs also possible today, without AI: criminals presumably constantly find new ways of hiding things from the police.
My guess is that programming AI to follow law might be easier or preferable to enforcing against human-principals. A weakly aligned AI (not X-risk or risk to principals, but not bound by law or general human morality) deployed by a human principal will probably come across illegal ways to advance its principalâs goals. It will also probably be able to hide its actions, obscure its motives, and/âor evade detection better than humans could. If so, the equilibrium strategy is to give minimal oversight to the AI agent and tacitly allow it to break the law while advancing the principalâs goals, since enforcement against the principal is unlikely. This seems bad!
I agree that getting a guarantee of following the law is (probably) better than trying to ensure it through enforcement, all else equal. I also agree that in principle programming the AI to follow the law could give such a guarantee. So in some normative sense, I agree that it would be better if it were programmed to follow the law.
My main argument here is that it is not worth the effort. This factors into two claims:
First, it would be hard to do. I am a programmer /â ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers /â ML researchers would agree with me on this.
Second, it doesnât provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.
But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the policeâs job easier.
This is valuable information. However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.
This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.
Certainly you still need legal accountabilityâwhy wouldnât we have that? If we solve alignment, then we can just have the AIâs owner be accountable for any law-breaking actions the AI takes.
Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say âthe computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager wonât take itâ.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
When you say âprogramming AI to follow lawâ I imagine case 1 above (but for AI systems instead of humans). Certainly the OP seemed to be arguing for this case. This is the thing I think is extremely difficult.
I am much happier about AI systems learning about the law via case 2 above, which would enable the AI police applications I mentioned above.
I suspect they are thinking about case 2 above? Or they might be thinking of self-driving car type applications where you have an in-code representation of the world? Idk, I feel confident enough of this that Iâd predict that there is a miscommunication somewhere, rather than an actual strong difference of opinion between me and them.
I agree that that is a very good and desirable step to take. However, as I said, it also incentives the AI-agent to obfuscate its actions and intentions to save its principal. In the human context, human agents do this but are independently disincentivized from breaking the law they face legal liability (a disincentive) for their actions. I want (and I suspect you also want) AI systems to have such incentivization.
If I understand correctly, you identify two ways to do this in the teenager analogy:
Rewiring
Explaining laws and their consequences and letting the agentâs existing incentives do the rest.
I could be wrong about this, but ultimately, for AI systems, it seems like both are actually similarly difficult. As youâve said, for 2. to be most effective, you probably need âAI police.â Those police will need a way of interpreting the legality of an AI agentâs {âmentalâ state; actions} and mapping them only existing laws.
But if you need to do that for effective enforcement, I donât see why (from a societal perspective) we shouldnât just do that on the actorâs side and not the âpoliceâsâ side. Baking the enforcement into the agents has the benefits of:
Not incentivizing an arms race
Giving the enforcerâs a clearer picture of the AIâs âmental stateâ
Not obviously. My point is just that if the AI is aligned with an human principal, and that human principal can be held accountable for the AIâs actions, then that automatically disincentivizes AI systems from breaking the law.
(Iâm not particularly opposed to AI systems being disincentivized directly, e.g. by making it possible to hold AI systems accountable for their actions. It just doesnât seem necessary in the world where weâve solved alignment.)
I agree that doing it on the actorâs side is better if you can ensure it for all actors, but you have to also prevent the human principal from getting a different actor that isnât bound by law.
E.g. if you have a chauffeur who refuses to exceed the speed limit (in a country where the speed limit thatâs actually enforced is 10mph higher), you fire that chauffeur and find a different one.
(Also, Iâm assuming youâre teaching the agent to follow the law via something like case 2 above, where you have it read the law and understand it using its existing abilities, and then train it somehow to not break the law. If you were instead thinking something like case 1, Iâd make the second argument that it isnât likely to work.)
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if theyâre used?
EDIT: Hmm, maybe they need strong incentives to check in with law enforcement periodically? This would be bounded per interval of time, and also (much) greater in absolute sign than any other reward they could get per period.
Iâm going to interpret this as:
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Assume that enforcement is impossible.
Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)
But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesnât seem like a strong argument to focus on case 1.
Couldnât the AI end up misaligned with the owners by accident, even if theyâre aligned with the rest of humanity? The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
I edited my comment above before I got your reply to include the possibility of the AI being incentivized to ensure it gets monitored by law enforcement. Its reward function could look like
where f is bounded to have a range of length â€1, and IMi(x) is 1 if the AI is monitored by law enforcement in period i (and passes some test) and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term canât evaluate to infinity since that would allow f to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow f to overcome the IMi.
Yes, but as I said earlier, Iâm assuming the alignment problem has already been solved when talking about enforcement. I am not proposing enforcement as a solution to alignment.
If you havenât solved the alignment problem, enforcement doesnât help much, because you canât rely on your AI-enabled police to help catch the AI-enabled criminals, because the police AI itself may not be aligned with the police.
Case 2 is assuming that you already have an intelligent agent with motivations, and then trying to deal with that after the fact. I agree this is not going to work for alignment. If for some reason I could only do 1 or 2 for alignment, I would try 1. (But there are in fact a bunch of other things that you can do.)
Isnât most of this after a crime has already been committed? Is that enough if itâs an existential risk? To handle this, would we want continuous monitoring of autonomous AIs, at which point arenât we actually just taking their autonomy away?
Also, if we want to automate âdetect bad behaviorâ, wouldnât that require AI alignment, too? If we donât fully automate it, then can we be confident that humans can keep up with everything they need to check themselves, given that AIs could work extremely fast? AIs might learn how much work humans can keep up with and then overwhelm them.
Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.
Cullenâs argument was âalignment may not be enough, even if you solve alignment you might still want to program your AI to follow the law because <reasons>.â So in my responses Iâve been assuming that we have solved alignment; Iâm arguing that after solving alignment, AI-powered enforcement will probably be enough to handle the problems Cullen is talking about. Some quotes from Cullenâs comment (emphasis mine):
Some responses to your comments:
Yes, Iâm assuming weâve solved alignment here.
Good enforcement is also a deterrent against crime (someone without any qualms about murder will still usually not murder because of the harsh penalties and chance of being caught).
Remember that the police are also AI-enabled, and can find new ways of detecting things. Even so, this is possible: but itâs also possible today, without AI: criminals presumably constantly find new ways of hiding things from the police.