I don’t think this shows that there is a 99% chance that AI systems will deceive their programmers.
Agreed. I wasn’t trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you’re new to the field and haven’t searched around for counter-arguments, counter-counter-arguments, etc.
The AI might still be good at advancing human welfare even if human operators are disempowered. If so, that seems like a good outcome, from a utilitarian point of view.
In the vast majority of ‘AGI with a random goal trying to deceiving you’ scenarios, I think the random goal produces outcomes like paperclips, rather than ‘sort-of-good’ outcomes.
I think the same in the case of ‘AGI with a goal sort-of related to advancing human welfare in the training set’, though the argument for this is less obvious.
I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9⁄10 of the numbers correct gets you 0% of the value but getting 10⁄10 right gets you 100% of the value.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Goodhart’s Curse in this form says that a powerful agent neutrally optimizing a proxy measure U that we hoped to align with true values V, will implicitly seek out upward divergences of U from V.
In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we’d regard as an error in defining that utility function.
[...] Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.
Even if U is locally an unbiased estimator of V, optimizing U will seek out what we would regard as ‘errors in the definition’, places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U—V is high; that is, places where V is lower than U. This may especially include regions of the outcome space or policy space where the value learning system was subject to great variance; that is, places where the value learning worked poorly or ran into a snag.
Goodhart’s Curse would be expected to grow worse as the AI became more powerful. A more powerful AI would be implicitly searching a larger space and would have more opportunity to uncover what we’d regard as “errors”; it would be able to find smaller loopholes, blow up more minor flaws.
[...] We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our ‘truly intended’ V and is looking for some place that V can be minimized while appearing to satisfy U; but just because the genie is neutrally seeking out very large values of U and these are places where it is unusually likely that U diverged upward from V.
So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value. (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)
Another part of the problem is that powerfully optimizing one value will tend to crowd out other values.
And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since “places where our specification of what’s good was wrong” are especially likely to include more “places where you can score extremely high on the specification”.
To me, getting an AI to improve sentient life seems like a good result, even if human controllers are disempowered.
Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I’m worried is that I expect misaligned AGI to produce things morally equivalent to “granite spheres” instead.
Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?
I will give the resources you shared a read. Thanks for the interesting discussion!
Agreed. I wasn’t trying to argue for a specific probability assignment; that seems hard, and it seems harder to reach extreme probabilities if you’re new to the field and haven’t searched around for counter-arguments, counter-counter-arguments, etc.
In the vast majority of ‘AGI with a random goal trying to deceiving you’ scenarios, I think the random goal produces outcomes like paperclips, rather than ‘sort-of-good’ outcomes.
I think the same in the case of ‘AGI with a goal sort-of related to advancing human welfare in the training set’, though the argument for this is less obvious.
I think Complex Value Systems are Required to Realize Valuable Futures is a good overview: human values are highly multidimensional, and in such a way that there are many different dimensions where a slightly wrong answer can lose you all of the value. Structurally like a combination lock, where getting 9⁄10 of the numbers correct gets you 0% of the value but getting 10⁄10 right gets you 100% of the value.
Also relevant is Stuart Russell’s point:
And Goodhart’s Curse:
So part of the issue is that human values inherently require getting a lot of bits correct simultaneously, in order to produce any value. (And also, getting a lot of the bits right while getting a few wrong can pose serious s-risks.)
Another part of the problem is that powerfully optimizing one value will tend to crowd out other values.
And a third part of the problem is that insofar as there are flaws in our specification of what we value, AGI is likely to disproportionately seek out and exploit those flaws, since “places where our specification of what’s good was wrong” are especially likely to include more “places where you can score extremely high on the specification”.
Agreed! If I thought a misaligned AGI were likely to produce an awesome flourishing civilization (but kill humans in the process), I would be vastly less worried. By far the main reason I’m worried is that I expect misaligned AGI to produce things morally equivalent to “granite spheres” instead.
Thanks for this and sorry for the slow reply. ok, great so your earlier thought was that even if we tried to give the AI welfarist goals, it would most likely end up with some random goal like optimising granite spheres or paperclips?
I will give the resources you shared a read. Thanks for the interesting discussion!