“What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.”
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)?
No; this is why I said “current ML doesn’t have a way to point the system’s goals at specific physical objects in the world”, and why I said “getting a specific programmer-intended concept into the goal”.
The central difficulty isn’t ‘getting the AGI to instrumentally care about the world’s state’ or even ‘getting the AGI to terminally care about the world’s state’. (I don’t know how one would do the latter with any confidence, but maybe there’s some easy hack.)
Instead, the central difficulty is ‘getting the AGI to terminally care about a specific thing, as opposed to something relatively random’.
If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we’ve probably solved most of the alignment problem. It’s not necessarily a huge leap from this to saving the world.
The problem is that we don’t know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned ‘paperclips’, ‘granite spheres’, etc. in my previous comments, I was using these as stand-ins for ‘random goals that have little to do with human flourishing’. I wasn’t saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.
If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
The instrumental convergence thesis implies that it’s straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy ‘kill all humans’ (if any humans exist in its environment).
This doesn’t transfer over to ‘we know how to robustly build AGI that has humane values’, because (a) humane values aren’t a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.
But yes, if ‘kill all humans’ or ‘acquire resources’ or ‘make an AGI that’s very smart’ or ‘make an AGI that protects itself from being destroyed’ were the only thing we wanted from AGI, then the problem would already be solved.
Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world?
No, because (e.g.) a deceptive agent that is “playing nice” will be just as able to answer those questions well. There isn’t an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you’ll get those before you get real friendliness.
This doesn’t mean that it’s impossible to get real friendliness, but it means that you’ll need some method other than just looking at external behaviors in order to achieve friendliness.
This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
The paragraph you quoted isn’t talking about safety testing. It’s saying ‘gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks’, plus ‘there isn’t an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly’.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
He says “So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.” The “carries out some pivotal superhuman engineering task” is important too. This part, and the part where the AGI somehow respects the programmer’s “don’t kill people” goal, connects the two phrasings.
No; this is why I said “current ML doesn’t have a way to point the system’s goals at specific physical objects in the world”, and why I said “getting a specific programmer-intended concept into the goal”.
The central difficulty isn’t ‘getting the AGI to instrumentally care about the world’s state’ or even ‘getting the AGI to terminally care about the world’s state’. (I don’t know how one would do the latter with any confidence, but maybe there’s some easy hack.)
Instead, the central difficulty is ‘getting the AGI to terminally care about a specific thing, as opposed to something relatively random’.
If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we’ve probably solved most of the alignment problem. It’s not necessarily a huge leap from this to saving the world.
The problem is that we don’t know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned ‘paperclips’, ‘granite spheres’, etc. in my previous comments, I was using these as stand-ins for ‘random goals that have little to do with human flourishing’. I wasn’t saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.
The instrumental convergence thesis implies that it’s straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy ‘kill all humans’ (if any humans exist in its environment).
This doesn’t transfer over to ‘we know how to robustly build AGI that has humane values’, because (a) humane values aren’t a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.
But yes, if ‘kill all humans’ or ‘acquire resources’ or ‘make an AGI that’s very smart’ or ‘make an AGI that protects itself from being destroyed’ were the only thing we wanted from AGI, then the problem would already be solved.
No, because (e.g.) a deceptive agent that is “playing nice” will be just as able to answer those questions well. There isn’t an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you’ll get those before you get real friendliness.
This doesn’t mean that it’s impossible to get real friendliness, but it means that you’ll need some method other than just looking at external behaviors in order to achieve friendliness.
The paragraph you quoted isn’t talking about safety testing. It’s saying ‘gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks’, plus ‘there isn’t an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly’.
He says “So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.” The “carries out some pivotal superhuman engineering task” is important too. This part, and the part where the AGI somehow respects the programmer’s “don’t kill people” goal, connects the two phrasings.