Thanks for the questions, John! :) I think this is a great place for this kind of discussion.
(I do comms at MIRI, where Eliezer works. I tend to have very Eliezer-ish views of AI risk, though I don’t generally run my comments by Eliezer or other MIRI staff, so it’s always possible I’m saying something Eliezer would disagree with.)
2 You say “More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.”
Insofar as I understand this, it seems false. Like, if I designed a driverless car, I think it could be true that it could reliably identify things within the environment, such as dogs, other cars, and pedestrians. Is this what you mean by ‘point out’. It is true that it would learn what these are by sense data and reward but I don’t see why this means that such a system couldn’t reliably identify actual objects in the real world.
What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.
See points 21 (“Capabilities generalize further than alignment once capabilities start to generalize far.”) and 22 (“There is no analogous truth about there being a simple core of alignment”). These are saying that getting AGI to understand things is a lot easier, on paradigms like the current one, than getting it to have a specific intended motivation.
Similarly, if you gave the AI the goal of benefitting only the American people, I don’t understand, from what you have said, why the system would almost definitely kill everyone in the world once it is let out from its training and has to make a distributional shift into the wider world.
The short answer is: we don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’. We don’t even know how to get the goals to robustly point at much simpler physical systems that have crisp, known definitions (e.g., ‘carbon atoms arranged to form diamond’).
Absent such knowledge, we may be able to get an AI system to exhibit ‘American-benefitting-ish behaviors’ in a particular setting. But when you increase the AGI system’s capability, or move to a new setting, this correlation is likely to break, because the vast majority of systems that exhibit superficially ‘American-benefitting-ish behaviors’ in a specific setting will not generalize in the way we implicitly want them to. The space of possible goals is too large and multidimensional for that, and the intuitive human idea of ‘what counts as benefiting Americans’ is too complex and unnatural.
A particularly dangerous example of “the superficially good system won’t generalize in the way we implicitly want it to” is if the AI system is strategically aware and trying to gain influence. In that case, the system may deliberately look more friendly than it is, look more-liable-to-generalize-out-of-distribution than it in fact will, etc.
The simplest way this connects up to ‘death’ is that the system just doesn’t have a goal that remotely resembles what you intended. It has a goal that only correlated with ‘benefit Americans’ under very specific circumstances; or it has a totally unrelated random goal (e.g., ‘maximize the number of granite spheres in the universe’), but had a belief that this goal would be better-served in the near term if its behavior satisfied the programmers.
(This belief is often true, because there are likelier to be more granite spheres in the future if the programmers think you are friendly and useful, because this gives you an avenue for gaining more influence later, and reduces the probability that the programmers will shut you down or change your goals.)
A powerful AGI acting on the world with a goal like ‘maximize the number of granite spheres’ will (with high probability) kill everyone, because (a) humans are potentially threats to its sphere agenda (e.g., we could build a rival superintelligence that has a different goal), and (b) humans are made of raw materials which can be repurposed to build more spheres and infrastructure.
In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?
By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance. The argument for this as a default is 23 (“you can’t bring the coffee if you’re dead”).
If you’re a paperclip maximizer and your operator is a staple maximizer, then you have a strong incentive to find ways to reduce your operator’s influence over the future and increase your own influence, so that there are more paperclips in the future and fewer staples.
“Intervene on the part of the world that is my operator’s beliefs, in ways that increase my influence” is a special case of “intervening on the world in general, in ways that increase my influence”. We shouldn’t generally expect it to be easy to get an AGI to specifically carve out an exception for the former, while freely doing the latter—because “my operator’s brain” is not a simple, crisp, easy-to-formally-specify idea, but also because we don’t know how to robustly point AGI goals at specific ideas even when they are simple, crisp, and easy to formally specify.
See also 8:
“The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve; you can’t build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.”
You can try to train the system to dislike deception, but this is triply difficult to do because:
it’s hard to train robust goals at all;
it should be even harder to robustly train complex, value-laden goals that we have a fuzzy sense of but don’t know how to crisply define; and most importantly
we’re actively pushing against the default incentives most possible systems have.
The latter point is discussed more in 24.2:
“The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You’re not trying to make it have an opinion on something the core was previously neutral on. You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it’s incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.”
‘Don’t deceive your operators, even when you aren’t perfectly aligned with your operator and have different goals from them’ is an example of a corrigible behavior. This goal is like ’222 + 222 = 555′ because it locally violates the ‘you can’t get the coffee if you’re dead’ principle (as a special case of the principle ‘you’re likelier to get the coffee insofar as you have more influence over the future’).
We’re trying to get the system to generally be smart, useful, and strategic about some domain, but trying to get it not to understand, or not to care about, one of the most basic strategic implications of multi-agent scenarios: that when two agents have different goals, each agent will better achieve its goals if it gains control and the other agent loses control. This should be possible in principle, but on the face of it, it looks difficult.
you say “By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance.” I agree that badly misaligned smart agents are likely to try to deceive their operators. But I was discussing the following proposition: “among advanced AI systems that we might plausibly make, there is a 99% chance of deception”. Your claim is about the subset of misaligned agents, not how likely we are to produce misaligned agents (that might deceive us)
I take it that 23 shows that all systems have incentives not to be turned off. I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Thanks for the three point argument, that is clarifying. I agree that if those premises are true, then we should expect AI systems to seek power over human operators who might try to turn them off or change their goals. If the goal is something like ‘increase total human welfare’ and the AI has a different idea about that than its operator, then the AI will try to disempower the operator in one way or another. But I’m not sure I see why this is necessarily a bad outcome. The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
This gets back to some of the ambiguity about alignment that pops up in the AI safety literature. I have been informally asking people working on AI what they mean by alignment over the last year, and nearly every answer has been importantly different from any of the others. To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Agreed. I wasn’t trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you’re new to the field and haven’t searched around for counter-arguments, counter-counter-arguments, etc.
The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
In the vast majority of ‘AGI with a random goal trying to deceiving you’ scenarios, I think the random goal produces outcomes like paperclips, rather than ‘sort-of-good’ outcomes.
I think the same in the case of ‘AGI with a goal sort-of related to advancing human welfare in the training set’, though the argument for this is less obvious.
I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9⁄10 of the numbers correct gets you 0% of the value but getting 10⁄10 right gets you 100% of the value.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Goodhart’s Curse in this form says that a powerful agent neutrally optimizing a proxy measure U that we hoped to align with true values V, will implicitly seek out upward divergences of U from V.
In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we’d regard as an error in defining that utility function.
[...] Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.
Even if U is locally an unbiased estimator of V, optimizing U will seek out what we would regard as ‘errors in the definition’, places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U—V is high; that is, places where V is lower than U. This may especially include regions of the outcome space or policy space where the value learning system was subject to great variance; that is, places where the value learning worked poorly or ran into a snag.
Goodhart’s Curse would be expected to grow worse as the AI became more powerful. A more powerful AI would be implicitly searching a larger space and would have more opportunity to uncover what we’d regard as “errors”; it would be able to find smaller loopholes, blow up more minor flaws.
[...] We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our ‘truly intended’ V and is looking for some place that V can be minimized while appearing to satisfy U; but just because the genie is neutrally seeking out very large values of U and these are places where it is unusually likely that U diverged upward from V.
So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value. (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)
Another part of the problem is that powerfully optimizing one value will tend to crowd out other values.
And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since “places where our specification of what’s good was wrong” are especially likely to include more “places where you can score extremely high on the specification”.
To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I’m worried is that I expect misaligned AGI to produce things morally equivalent to “granite spheres” instead.
Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?
I will give the resources you shared a read. Thanks for the interesting discussion!
Thanks for this detailed response. I appreciate getting the opportunity to discuss this in depth.
What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)? If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
I don’t really get how there can be such a firm dividing line between understanding the world and having motivations that are faithful to the intentions of the programmer. If a system can understand the world really well, it can eg understand what pleasure is really well. Why then would it be extremely difficult to get it to optimise the amount of pleasure in the world? Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world? I still don’t really get why this would with ~100% probability kill everyone.
The key point in the argument in 21 seems to be:
In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
The first sentence seems like a non-sequitur and I’m not sure why it is relevant to the argument. Of course there are unboundedly many utility functions that programmers could give AIs. On the second sentence, it is true that reality doesn’t hit back against things that are locally aligned on test cases but globally misaligned on the broader set of test cases. But I take it what the argument is trying to defend is the proposition “we are extremely likely to make a system that is locally aligned in test cases but globally misaligned”. This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
I agree that AGIs with the goal of maximising granite spheres and things like that would kill everyone or do something very bad. The harder cases is where you give an AI a welfarist goal.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
“What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.”
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)?
No; this is why I said “current ML doesn’t have a way to point the system’s goals at specific physical objects in the world”, and why I said “getting a specific programmer-intended concept into the goal”.
The central difficulty isn’t ‘getting the AGI to instrumentally care about the world’s state’ or even ‘getting the AGI to terminally care about the world’s state’. (I don’t know how one would do the latter with any confidence, but maybe there’s some easy hack.)
Instead, the central difficulty is ‘getting the AGI to terminally care about a specific thing, as opposed to something relatively random’.
If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we’ve probably solved most of the alignment problem. It’s not necessarily a huge leap from this to saving the world.
The problem is that we don’t know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned ‘paperclips’, ‘granite spheres’, etc. in my previous comments, I was using these as stand-ins for ‘random goals that have little to do with human flourishing’. I wasn’t saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.
If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
The instrumental convergence thesis implies that it’s straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy ‘kill all humans’ (if any humans exist in its environment).
This doesn’t transfer over to ‘we know how to robustly build AGI that has humane values’, because (a) humane values aren’t a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.
But yes, if ‘kill all humans’ or ‘acquire resources’ or ‘make an AGI that’s very smart’ or ‘make an AGI that protects itself from being destroyed’ were the only thing we wanted from AGI, then the problem would already be solved.
Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world?
No, because (e.g.) a deceptive agent that is “playing nice” will be just as able to answer those questions well. There isn’t an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you’ll get those before you get real friendliness.
This doesn’t mean that it’s impossible to get real friendliness, but it means that you’ll need some method other than just looking at external behaviors in order to achieve friendliness.
This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
The paragraph you quoted isn’t talking about safety testing. It’s saying ‘gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks’, plus ‘there isn’t an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly’.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
He says “So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.” The “carries out some pivotal superhuman engineering task” is important too. This part, and the part where the AGI somehow respects the programmer’s “don’t kill people” goal, connects the two phrasings.
Thanks for the questions, John! :) I think this is a great place for this kind of discussion.
(I do comms at MIRI, where Eliezer works. I tend to have very Eliezer-ish views of AI risk, though I don’t generally run my comments by Eliezer or other MIRI staff, so it’s always possible I’m saying something Eliezer would disagree with.)
What Eliezer’s saying here is that current ML doesn’t have a way to point the system’s goals at specific physical objects in the world. Sufficiently advanced AI will end up knowing that the physical objects exist (i.e., it will incorporate those things into its beliefs), but this is different from getting a specific programmer-intended concept into the goal.
See points 21 (“Capabilities generalize further than alignment once capabilities start to generalize far.”) and 22 (“There is no analogous truth about there being a simple core of alignment”). These are saying that getting AGI to understand things is a lot easier, on paradigms like the current one, than getting it to have a specific intended motivation.
The short answer is: we don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’. We don’t even know how to get the goals to robustly point at much simpler physical systems that have crisp, known definitions (e.g., ‘carbon atoms arranged to form diamond’).
Absent such knowledge, we may be able to get an AI system to exhibit ‘American-benefitting-ish behaviors’ in a particular setting. But when you increase the AGI system’s capability, or move to a new setting, this correlation is likely to break, because the vast majority of systems that exhibit superficially ‘American-benefitting-ish behaviors’ in a specific setting will not generalize in the way we implicitly want them to. The space of possible goals is too large and multidimensional for that, and the intuitive human idea of ‘what counts as benefiting Americans’ is too complex and unnatural.
A particularly dangerous example of “the superficially good system won’t generalize in the way we implicitly want it to” is if the AI system is strategically aware and trying to gain influence. In that case, the system may deliberately look more friendly than it is, look more-liable-to-generalize-out-of-distribution than it in fact will, etc.
The simplest way this connects up to ‘death’ is that the system just doesn’t have a goal that remotely resembles what you intended. It has a goal that only correlated with ‘benefit Americans’ under very specific circumstances; or it has a totally unrelated random goal (e.g., ‘maximize the number of granite spheres in the universe’), but had a belief that this goal would be better-served in the near term if its behavior satisfied the programmers.
(This belief is often true, because there are likelier to be more granite spheres in the future if the programmers think you are friendly and useful, because this gives you an avenue for gaining more influence later, and reduces the probability that the programmers will shut you down or change your goals.)
A powerful AGI acting on the world with a goal like ‘maximize the number of granite spheres’ will (with high probability) kill everyone, because (a) humans are potentially threats to its sphere agenda (e.g., we could build a rival superintelligence that has a different goal), and (b) humans are made of raw materials which can be repurposed to build more spheres and infrastructure.
By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance. The argument for this as a default is 23 (“you can’t bring the coffee if you’re dead”).
If you’re a paperclip maximizer and your operator is a staple maximizer, then you have a strong incentive to find ways to reduce your operator’s influence over the future and increase your own influence, so that there are more paperclips in the future and fewer staples.
“Intervene on the part of the world that is my operator’s beliefs, in ways that increase my influence” is a special case of “intervening on the world in general, in ways that increase my influence”. We shouldn’t generally expect it to be easy to get an AGI to specifically carve out an exception for the former, while freely doing the latter—because “my operator’s brain” is not a simple, crisp, easy-to-formally-specify idea, but also because we don’t know how to robustly point AGI goals at specific ideas even when they are simple, crisp, and easy to formally specify.
See also 8:
“The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve; you can’t build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.”
You can try to train the system to dislike deception, but this is triply difficult to do because:
it’s hard to train robust goals at all;
it should be even harder to robustly train complex, value-laden goals that we have a fuzzy sense of but don’t know how to crisply define; and most importantly
we’re actively pushing against the default incentives most possible systems have.
The latter point is discussed more in 24.2:
“The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You’re not trying to make it have an opinion on something the core was previously neutral on. You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it’s incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.”
‘Don’t deceive your operators, even when you aren’t perfectly aligned with your operator and have different goals from them’ is an example of a corrigible behavior. This goal is like ’222 + 222 = 555′ because it locally violates the ‘you can’t get the coffee if you’re dead’ principle (as a special case of the principle ‘you’re likelier to get the coffee insofar as you have more influence over the future’).
We’re trying to get the system to generally be smart, useful, and strategic about some domain, but trying to get it not to understand, or not to care about, one of the most basic strategic implications of multi-agent scenarios: that when two agents have different goals, each agent will better achieve its goals if it gains control and the other agent loses control. This should be possible in principle, but on the face of it, it looks difficult.
you say “By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance.” I agree that badly misaligned smart agents are likely to try to deceive their operators. But I was discussing the following proposition: “among advanced AI systems that we might plausibly make, there is a 99% chance of deception”. Your claim is about the subset of misaligned agents, not how likely we are to produce misaligned agents (that might deceive us)
I take it that 23 shows that all systems have incentives not to be turned off. I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Thanks for the three point argument, that is clarifying. I agree that if those premises are true, then we should expect AI systems to seek power over human operators who might try to turn them off or change their goals. If the goal is something like ‘increase total human welfare’ and the AI has a different idea about that than its operator, then the AI will try to disempower the operator in one way or another. But I’m not sure I see why this is necessarily a bad outcome. The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
This gets back to some of the ambiguity about alignment that pops up in the AI safety literature. I have been informally asking people working on AI what they mean by alignment over the last year, and nearly every answer has been importantly different from any of the others. To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
Holden saw your questions and decided to write a new series to explain.
Agreed. I wasn’t trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you’re new to the field and haven’t searched around for counter-arguments, counter-counter-arguments, etc.
In the vast majority of ‘AGI with a random goal trying to deceiving you’ scenarios, I think the random goal produces outcomes like paperclips, rather than ‘sort-of-good’ outcomes.
I think the same in the case of ‘AGI with a goal sort-of related to advancing human welfare in the training set’, though the argument for this is less obvious.
I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9⁄10 of the numbers correct gets you 0% of the value but getting 10⁄10 right gets you 100% of the value.
Also relevant is Stuart Russell’s point:
And Goodhart’s Curse:
So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value. (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)
Another part of the problem is that powerfully optimizing one value will tend to crowd out other values.
And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since “places where our specification of what’s good was wrong” are especially likely to include more “places where you can score extremely high on the specification”.
Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I’m worried is that I expect misaligned AGI to produce things morally equivalent to “granite spheres” instead.
Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?
I will give the resources you shared a read. Thanks for the interesting discussion!
Hi Rob,
Thanks for this detailed response. I appreciate getting the opportunity to discuss this in depth.
I’m not sure whether I have misunderstood, but doesn’t this imply that advanced AI cannot (eg) maximise the number of paperclips (or granite spheres) in the world (even though it can know that paperclips exist and what they are)? If I gave an AI the aim of ‘kill all humans’ then don’t the system’s goals point at objects in the world? Since you think that it is almost certain that AGI will kill all humans as an intermediate goal for any ultimate goal we give it, doesn’t that mean it would be straightforward to give AIs the goal of ‘kill all humans’?
I don’t really get how there can be such a firm dividing line between understanding the world and having motivations that are faithful to the intentions of the programmer. If a system can understand the world really well, it can eg understand what pleasure is really well. Why then would it be extremely difficult to get it to optimise the amount of pleasure in the world? Could we test a system out for ages asking it to correctly identify improvements in total welfare, and then once we have tested it for ages put it out in the world? I still don’t really get why this would with ~100% probability kill everyone.
The key point in the argument in 21 seems to be:
The first sentence seems like a non-sequitur and I’m not sure why it is relevant to the argument. Of course there are unboundedly many utility functions that programmers could give AIs. On the second sentence, it is true that reality doesn’t hit back against things that are locally aligned on test cases but globally misaligned on the broader set of test cases. But I take it what the argument is trying to defend is the proposition “we are extremely likely to make a system that is locally aligned in test cases but globally misaligned”. This argument doesn’t tell us anything about whether this proposition is true, it just tells us that if systems are locally aligned in test cases and globally misaligned, then they’ll get past our current safety testing.
I agree that AGIs with the goal of maximising granite spheres and things like that would kill everyone or do something very bad. The harder cases is where you give an AI a welfarist goal.
An important note in passing. At the start, Eliezer defines alignment as “>0 people survive” but in the remainder of the piece, he often seems to refer to alignment as the more prosaic ‘alignment with the intent of the programmer’. I find this ambiguity pops up a lot in AI safety writing.
No; this is why I said “current ML doesn’t have a way to point the system’s goals at specific physical objects in the world”, and why I said “getting a specific programmer-intended concept into the goal”.
The central difficulty isn’t ‘getting the AGI to instrumentally care about the world’s state’ or even ‘getting the AGI to terminally care about the world’s state’. (I don’t know how one would do the latter with any confidence, but maybe there’s some easy hack.)
Instead, the central difficulty is ‘getting the AGI to terminally care about a specific thing, as opposed to something relatively random’.
If we could build an AGI that we knew in advance, with confidence, would specifically optimize for the number of paperclips in the universe and nothing else, then that would mean that we’ve probably solved most of the alignment problem. It’s not necessarily a huge leap from this to saving the world.
The problem is that we don’t know how to do that, so AGI will instead (by default) end up with some random unintended goal. When I mentioned ‘paperclips’, ‘granite spheres’, etc. in my previous comments, I was using these as stand-ins for ‘random goals that have little to do with human flourishing’. I wasn’t saying we know how to specifically aim an AGI at paperclips, or at granite spheres, on purpose. If we could, that would be a totally different ball game.
The instrumental convergence thesis implies that it’s straightforward, if you know how to build AGI at all, to build an AGI that has the instrumental strategy ‘kill all humans’ (if any humans exist in its environment).
This doesn’t transfer over to ‘we know how to robustly build AGI that has humane values’, because (a) humane values aren’t a convergent instrumental strategy, and (b) we only know how to build AGIs that pursue convergent instrumental strategies with high probability, not how to build AGIs that pursue arbitrary goals with high probability.
But yes, if ‘kill all humans’ or ‘acquire resources’ or ‘make an AGI that’s very smart’ or ‘make an AGI that protects itself from being destroyed’ were the only thing we wanted from AGI, then the problem would already be solved.
No, because (e.g.) a deceptive agent that is “playing nice” will be just as able to answer those questions well. There isn’t an external behavioral test that reliably distinguishes deceptive agents from genuinely friendly ones; and most agents are unfriendly/deceptive, so the prior is strongly that you’ll get those before you get real friendliness.
This doesn’t mean that it’s impossible to get real friendliness, but it means that you’ll need some method other than just looking at external behaviors in order to achieve friendliness.
The paragraph you quoted isn’t talking about safety testing. It’s saying ‘gradient-descent-ish processes that score sufficiently well on almost any highly rich, real-world task will tend to converge on similar core capabilities, because these core capabilities are relatively simple and broadly useful for many tasks’, plus ‘there isn’t an analogous process pushing arbitrary well-performing gradient-descent-ish processes toward being human-friendly’.
He says “So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.” The “carries out some pivotal superhuman engineering task” is important too. This part, and the part where the AGI somehow respects the programmer’s “don’t kill people” goal, connects the two phrasings.