Thanks for this detailed response. I appreciate getting the opportunity to discuss this in depth.
What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)? If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
I don’t really get how there can be such a firm dividing line between understanding the world and having motivations that are faithful to the intentions of the programmer. If a system can understand the world really well, it can eg understand what pleasure is really well. Why then would it be extremely difficult to get it to optimise the amount of pleasure in the world? Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world? I still don’t really get why this would with ~100% probability kill everyone.
The key point in the argument in 21 seems to be:
In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
The first sentence seems like a non-sequitur and I’m not sure why it is relevant to the argument. Of course there are unboundedly many utility functions that programmers could give AIs. On the second sentence, it is true that reality doesn’t hit back against things that are locally aligned on test cases but globally misaligned on the broader set of test cases. But I take it what the argument is trying to defend is the proposition “we are extremely likely to make a system that is locally aligned in test cases but globally misaligned”. This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
I agree that AGIs with the goal of maximising granite spheres and things like that would kill everyone or do something very bad. The harder cases is where you give an AI a welfarist goal.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
“What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.”
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)?
No; this is why I said “current ML doesn’t have a way to point the system’s goals at specific physical objects in the world”, and why I said “getting a specific programmer-intended concept into the goal”.
The central difficulty isn’t ‘getting the AGI to instrumentally care about the world’s state’ or even ‘getting the AGI to terminally care about the world’s state’. (I don’t know how one would do the latter with any confidence, but maybe there’s some easy hack.)
Instead, the central difficulty is ‘getting the AGI to terminally care about a specific thing, as opposed to something relatively random’.
If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we’ve probably solved most of the alignment problem. It’s not necessarily a huge leap from this to saving the world.
The problem is that we don’t know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned ‘paperclips’, ‘granite spheres’, etc. in my previous comments, I was using these as stand-ins for ‘random goals that have little to do with human flourishing’. I wasn’t saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.
If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
The instrumental convergence thesis implies that it’s straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy ‘kill all humans’ (if any humans exist in its environment).
This doesn’t transfer over to ‘we know how to robustly build AGI that has humane values’, because (a) humane values aren’t a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.
But yes, if ‘kill all humans’ or ‘acquire resources’ or ‘make an AGI that’s very smart’ or ‘make an AGI that protects itself from being destroyed’ were the only thing we wanted from AGI, then the problem would already be solved.
Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world?
No, because (e.g.) a deceptive agent that is “playing nice” will be just as able to answer those questions well. There isn’t an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you’ll get those before you get real friendliness.
This doesn’t mean that it’s impossible to get real friendliness, but it means that you’ll need some method other than just looking at external behaviors in order to achieve friendliness.
This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
The paragraph you quoted isn’t talking about safety testing. It’s saying ‘gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks’, plus ‘there isn’t an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly’.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
He says “So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.” The “carries out some pivotal superhuman engineering task” is important too. This part, and the part where the AGI somehow respects the programmer’s “don’t kill people” goal, connects the two phrasings.
Hi Rob,
Thanks for this detailed response. I appreciate getting the opportunity to discuss this in depth.
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)? If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
I don’t really get how there can be such a firm dividing line between understanding the world and having motivations that are faithful to the intentions of the programmer. If a system can understand the world really well, it can eg understand what pleasure is really well. Why then would it be extremely difficult to get it to optimise the amount of pleasure in the world? Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world? I still don’t really get why this would with ~100% probability kill everyone.
The key point in the argument in 21 seems to be:
The first sentence seems like a non-sequitur and I’m not sure why it is relevant to the argument. Of course there are unboundedly many utility functions that programmers could give AIs. On the second sentence, it is true that reality doesn’t hit back against things that are locally aligned on test cases but globally misaligned on the broader set of test cases. But I take it what the argument is trying to defend is the proposition “we are extremely likely to make a system that is locally aligned in test cases but globally misaligned”. This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
I agree that AGIs with the goal of maximising granite spheres and things like that would kill everyone or do something very bad. The harder cases is where you give an AI a welfarist goal.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
No; this is why I said “current ML doesn’t have a way to point the system’s goals at specific physical objects in the world”, and why I said “getting a specific programmer-intended concept into the goal”.
The central difficulty isn’t ‘getting the AGI to instrumentally care about the world’s state’ or even ‘getting the AGI to terminally care about the world’s state’. (I don’t know how one would do the latter with any confidence, but maybe there’s some easy hack.)
Instead, the central difficulty is ‘getting the AGI to terminally care about a specific thing, as opposed to something relatively random’.
If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we’ve probably solved most of the alignment problem. It’s not necessarily a huge leap from this to saving the world.
The problem is that we don’t know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned ‘paperclips’, ‘granite spheres’, etc. in my previous comments, I was using these as stand-ins for ‘random goals that have little to do with human flourishing’. I wasn’t saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.
The instrumental convergence thesis implies that it’s straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy ‘kill all humans’ (if any humans exist in its environment).
This doesn’t transfer over to ‘we know how to robustly build AGI that has humane values’, because (a) humane values aren’t a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.
But yes, if ‘kill all humans’ or ‘acquire resources’ or ‘make an AGI that’s very smart’ or ‘make an AGI that protects itself from being destroyed’ were the only thing we wanted from AGI, then the problem would already be solved.
No, because (e.g.) a deceptive agent that is “playing nice” will be just as able to answer those questions well. There isn’t an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you’ll get those before you get real friendliness.
This doesn’t mean that it’s impossible to get real friendliness, but it means that you’ll need some method other than just looking at external behaviors in order to achieve friendliness.
The paragraph you quoted isn’t talking about safety testing. It’s saying ‘gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks’, plus ‘there isn’t an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly’.
He says “So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.” The “carries out some pivotal superhuman engineering task” is important too. This part, and the part where the AGI somehow respects the programmer’s “don’t kill people” goal, connects the two phrasings.