In AI training, we could punish systems strongly for deception to make it strongly disfavoured. Are you saying that deception in training is a 1% chance or a 99% chance? What is the argument for either number?
By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance. The argument for this as a default is 23 (“you can’t bring the coffee if you’re dead”).
If you’re a paperclip maximizer and your operator is a staple maximizer, then you have a strong incentive to find ways to reduce your operator’s influence over the future and increase your own influence, so that there are more paperclips in the future and fewer staples.
“Intervene on the part of the world that is my operator’s beliefs, in ways that increase my influence” is a special case of “intervening on the world in general, in ways that increase my influence”. We shouldn’t generally expect it to be easy to get an AGI to specifically carve out an exception for the former, while freely doing the latter—because “my operator’s brain” is not a simple, crisp, easy-to-formally-specify idea, but also because we don’t know how to robustly point AGI goals at specific ideas even when they are simple, crisp, and easy to formally specify.
See also 8:
“The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve; you can’t build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.”
You can try to train the system to dislike deception, but this is triply difficult to do because:
it’s hard to train robust goals at all;
it should be even harder to robustly train complex, value-laden goals that we have a fuzzy sense of but don’t know how to crisply define; and most importantly
we’re actively pushing against the default incentives most possible systems have.
The latter point is discussed more in 24.2:
“The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You’re not trying to make it have an opinion on something the core was previously neutral on. You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it’s incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.”
‘Don’t deceive your operators, even when you aren’t perfectly aligned with your operator and have different goals from them’ is an example of a corrigible behavior. This goal is like ’222 + 222 = 555′ because it locally violates the ‘you can’t get the coffee if you’re dead’ principle (as a special case of the principle ‘you’re likelier to get the coffee insofar as you have more influence over the future’).
We’re trying to get the system to generally be smart, useful, and strategic about some domain, but trying to get it not to understand, or not to care about, one of the most basic strategic implications of multi-agent scenarios: that when two agents have different goals, each agent will better achieve its goals if it gains control and the other agent loses control. This should be possible in principle, but on the face of it, it looks difficult.
you say “By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance.” I agree that badly misaligned smart agents are likely to try to deceive their operators. But I was discussing the following proposition: “among advanced AI systems that we might plausibly make, there is a 99% chance of deception”. Your claim is about the subset of misaligned agents, not how likely we are to produce misaligned agents (that might deceive us)
I take it that 23 shows that all systems have incentives not to be turned off. I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Thanks for the three point argument, that is clarifying. I agree that if those premises are true, then we should expect AI systems to seek power over human operators who might try to turn them off or change their goals. If the goal is something like ‘increase total human welfare’ and the AI has a different idea about that than its operator, then the AI will try to disempower the operator in one way or another. But I’m not sure I see why this is necessarily a bad outcome. The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
This gets back to some of the ambiguity about alignment that pops up in the AI safety literature. I have been informally asking people working on AI what they mean by alignment over the last year, and nearly every answer has been importantly different from any of the others. To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Agreed. I wasn’t trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you’re new to the field and haven’t searched around for counter-arguments, counter-counter-arguments, etc.
The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
In the vast majority of ‘AGI with a random goal trying to deceiving you’ scenarios, I think the random goal produces outcomes like paperclips, rather than ‘sort-of-good’ outcomes.
I think the same in the case of ‘AGI with a goal sort-of related to advancing human welfare in the training set’, though the argument for this is less obvious.
I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9⁄10 of the numbers correct gets you 0% of the value but getting 10⁄10 right gets you 100% of the value.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Goodhart’s Curse in this form says that a powerful agent neutrally optimizing a proxy measure U that we hoped to align with true values V, will implicitly seek out upward divergences of U from V.
In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we’d regard as an error in defining that utility function.
[...] Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.
Even if U is locally an unbiased estimator of V, optimizing U will seek out what we would regard as ‘errors in the definition’, places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U—V is high; that is, places where V is lower than U. This may especially include regions of the outcome space or policy space where the value learning system was subject to great variance; that is, places where the value learning worked poorly or ran into a snag.
Goodhart’s Curse would be expected to grow worse as the AI became more powerful. A more powerful AI would be implicitly searching a larger space and would have more opportunity to uncover what we’d regard as “errors”; it would be able to find smaller loopholes, blow up more minor flaws.
[...] We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our ‘truly intended’ V and is looking for some place that V can be minimized while appearing to satisfy U; but just because the genie is neutrally seeking out very large values of U and these are places where it is unusually likely that U diverged upward from V.
So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value. (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)
Another part of the problem is that powerfully optimizing one value will tend to crowd out other values.
And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since “places where our specification of what’s good was wrong” are especially likely to include more “places where you can score extremely high on the specification”.
To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I’m worried is that I expect misaligned AGI to produce things morally equivalent to “granite spheres” instead.
Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?
I will give the resources you shared a read. Thanks for the interesting discussion!
By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance. The argument for this as a default is 23 (“you can’t bring the coffee if you’re dead”).
If you’re a paperclip maximizer and your operator is a staple maximizer, then you have a strong incentive to find ways to reduce your operator’s influence over the future and increase your own influence, so that there are more paperclips in the future and fewer staples.
“Intervene on the part of the world that is my operator’s beliefs, in ways that increase my influence” is a special case of “intervening on the world in general, in ways that increase my influence”. We shouldn’t generally expect it to be easy to get an AGI to specifically carve out an exception for the former, while freely doing the latter—because “my operator’s brain” is not a simple, crisp, easy-to-formally-specify idea, but also because we don’t know how to robustly point AGI goals at specific ideas even when they are simple, crisp, and easy to formally specify.
See also 8:
“The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve; you can’t build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.”
You can try to train the system to dislike deception, but this is triply difficult to do because:
it’s hard to train robust goals at all;
it should be even harder to robustly train complex, value-laden goals that we have a fuzzy sense of but don’t know how to crisply define; and most importantly
we’re actively pushing against the default incentives most possible systems have.
The latter point is discussed more in 24.2:
“The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You’re not trying to make it have an opinion on something the core was previously neutral on. You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it’s incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.”
‘Don’t deceive your operators, even when you aren’t perfectly aligned with your operator and have different goals from them’ is an example of a corrigible behavior. This goal is like ’222 + 222 = 555′ because it locally violates the ‘you can’t get the coffee if you’re dead’ principle (as a special case of the principle ‘you’re likelier to get the coffee insofar as you have more influence over the future’).
We’re trying to get the system to generally be smart, useful, and strategic about some domain, but trying to get it not to understand, or not to care about, one of the most basic strategic implications of multi-agent scenarios: that when two agents have different goals, each agent will better achieve its goals if it gains control and the other agent loses control. This should be possible in principle, but on the face of it, it looks difficult.
you say “By default, for sufficiently misaligned smart agents that can think about their operators, more like a 99% chance.” I agree that badly misaligned smart agents are likely to try to deceive their operators. But I was discussing the following proposition: “among advanced AI systems that we might plausibly make, there is a 99% chance of deception”. Your claim is about the subset of misaligned agents, not how likely we are to produce misaligned agents (that might deceive us)
I take it that 23 shows that all systems have incentives not to be turned off. I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Thanks for the three point argument, that is clarifying. I agree that if those premises are true, then we should expect AI systems to seek power over human operators who might try to turn them off or change their goals. If the goal is something like ‘increase total human welfare’ and the AI has a different idea about that than its operator, then the AI will try to disempower the operator in one way or another. But I’m not sure I see why this is necessarily a bad outcome. The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
This gets back to some of the ambiguity about alignment that pops up in the AI safety literature. I have been informally asking people working on AI what they mean by alignment over the last year, and nearly every answer has been importantly different from any of the others. To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
Holden saw your questions and decided to write a new series to explain.
Agreed. I wasn’t trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you’re new to the field and haven’t searched around for counter-arguments, counter-counter-arguments, etc.
In the vast majority of ‘AGI with a random goal trying to deceiving you’ scenarios, I think the random goal produces outcomes like paperclips, rather than ‘sort-of-good’ outcomes.
I think the same in the case of ‘AGI with a goal sort-of related to advancing human welfare in the training set’, though the argument for this is less obvious.
I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9⁄10 of the numbers correct gets you 0% of the value but getting 10⁄10 right gets you 100% of the value.
Also relevant is Stuart Russell’s point:
And Goodhart’s Curse:
So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value. (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)
Another part of the problem is that powerfully optimizing one value will tend to crowd out other values.
And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since “places where our specification of what’s good was wrong” are especially likely to include more “places where you can score extremely high on the specification”.
Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I’m worried is that I expect misaligned AGI to produce things morally equivalent to “granite spheres” instead.
Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?
I will give the resources you shared a read. Thanks for the interesting discussion!