Imagine trying to make teenagers law-abiding. You could have two strategies:
1. Rewire the neurons or learning algorithm in their brain such that you can say “the computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager won’t take it”.
2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.
I feel much better about 2 than 1.
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they’re used?
EDIT: Hmm, maybe they need strong incentives to check in with law enforcement periodically? This would be bounded per interval of time, and also (much) greater in absolute sign than any other reward they could get per period.
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they’re used?
I’m going to interpret this as:
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Assume that enforcement is impossible.
Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)
But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesn’t seem like a strong argument to focus on case 1.
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Couldn’t the AI end up misaligned with the owners by accident, even if they’re aligned with the rest of humanity? The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
I edited my comment above before I got your reply to include the possibility of the AI being incentivized to ensure it gets monitored by law enforcement. Its reward function could look like
f(x)+∞∑i=1IMi(x)
where f is bounded to have a range of length ≤1, and IMi(x) is 1 if the AI is monitored by law enforcement in period i (and passes some test) and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term can’t evaluate to infinity since that would allow f to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow f to overcome the IMi.
Couldn’t the AI end up misaligned with the owners by accident, even if they’re aligned with the rest of humanity?
Yes, but as I said earlier, I’m assuming the alignment problem has already been solved when talking about enforcement. I am not proposing enforcement as a solution to alignment.
If you haven’t solved the alignment problem, enforcement doesn’t help much, because you can’t rely on your AI-enabled police to help catch the AI-enabled criminals, because the police AI itself may not be aligned with the police.
The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
Case 2 is assuming that you already have an intelligent agent with motivations, and then trying to deal with that after the fact. I agree this is not going to work for alignment. If for some reason I could only do 1 or 2 for alignment, I would try 1. (But there are in fact a bunch of other things that you can do.)
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they’re used?
EDIT: Hmm, maybe they need strong incentives to check in with law enforcement periodically? This would be bounded per interval of time, and also (much) greater in absolute sign than any other reward they could get per period.
I’m going to interpret this as:
Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
Assume that enforcement is impossible.
Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)
But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesn’t seem like a strong argument to focus on case 1.
Couldn’t the AI end up misaligned with the owners by accident, even if they’re aligned with the rest of humanity? The question is whether 1 or 2 is better at aligning the AI in cases where enforcement is impossible or explicitly prevented.
I edited my comment above before I got your reply to include the possibility of the AI being incentivized to ensure it gets monitored by law enforcement. Its reward function could look like
where f is bounded to have a range of length ≤1, and IMi(x) is 1 if the AI is monitored by law enforcement in period i (and passes some test) and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term can’t evaluate to infinity since that would allow f to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow f to overcome the IMi.
Yes, but as I said earlier, I’m assuming the alignment problem has already been solved when talking about enforcement. I am not proposing enforcement as a solution to alignment.
If you haven’t solved the alignment problem, enforcement doesn’t help much, because you can’t rely on your AI-enabled police to help catch the AI-enabled criminals, because the police AI itself may not be aligned with the police.
Case 2 is assuming that you already have an intelligent agent with motivations, and then trying to deal with that after the fact. I agree this is not going to work for alignment. If for some reason I could only do 1 or 2 for alignment, I would try 1. (But there are in fact a bunch of other things that you can do.)