Again, this remark seems explicitly to assume that the AI is maximising some kind of reward function. Humans often act not as maximisers but as satisficers, choosing an outcome that is good enough rather than searching for the best possible outcome. Often humans also act on the basis of habit or following simple rules of thumb, and are often risk averse. As such, I believe that to assume that an AI agent would be necessarily maximising its reward is to make fairly strong assumptions about the nature of the AI in question. Absent these assumptions, it is not obvious why an AI would necessarily have any particular reason to usurp humanity.
Imagine that, when you wake up tomorrow morning, you will have acquired a magical ability to reach in and modify your own brain connections however you like.
Over breakfast, you start thinking about how frustrating it is that you’re in debt, and feeling annoyed at yourself that you’ve been spending so much money impulse-buying in-app purchases in Farmville. So you open up your new brain-editing console, look up which neocortical generative models were active the last few times you made a Farmville in-app purchase, and lower their prominence, just a bit.
Then you take a shower, and start thinking about the documentary you saw last night about gestation crates. ‘Man, I’m never going to eat pork again!’ you say to yourself. But you’ve said that many times before, and it’s never stuck. So after the shower, you open up your new brain-editing console, and pull up that memory of the gestation crate documentary and the way you felt after watching it, and set that memory and emotion to activate loudly every time you feel tempted to eat pork, for the rest of your life.
Do you see the direction that things are going? As time goes on, if an agent has the power of both meta-cognition and self-modification, any one of its human-like goals (quasi-goals which are context-dependent, self-contradictory, satisficing, etc.) can gradually transform itself into a utility-function-like goal (which is self-consistent, all-consuming, maximizing)! To be explicit: during the little bits of time when one particular goal happens to be salient and determining behavior, the agent may be motivated to “fix” any part of itself that gets in the way of that goal, until bit by bit, that one goal gradually cements its control over the whole system.
Moreover, if the agent does gradually self-modify from human-like quasi-goals to an all-consuming utility-function-like goal, then I would think it’s very difficult to predict exactly what goal it will wind up having. And most goals have problematic convergent instrumental sub-goals that could make them into x-risks.
...Well, at least, I find this a plausible argument, and don’t see any straightforward way to reliably avoid this kind of goal-transformation. But obviously this is super weird and hard to think about and I’m not very confident. :-)
(I think I stole this line of thought from Eliezer Yudkowsky but can’t find the reference.)
Everything up to here is actually just one of several lines of thought that lead to the conclusion that we might well get an AGI that is trying to maximize a reward.
Another line of thought is what Rohin said: We’ve been using reward functions since forever, so it’s quite possible that we’ll keep doing so.
Another line of thought is: We humans actually have explicit real-world goals, like curing Alzheimer’s and solving climate change etc. And generally the best way to achieve goals is to have an agent seeking them.
Another line of thought is: Different people will try to make AGIs in different ways, and it’s a big world, and (eventually by default) there will be very low barriers-to-entry in building AGIs. So (again by default) sooner or later someone will make an explicitly-goal-seeking AGI, even if thoughtful AGI experts pronounce that doing so is a terrible idea.
Imagine that, when you wake up tomorrow morning, you will have acquired a magical ability to reach in and modify your own brain connections however you like.
Over breakfast, you start thinking about how frustrating it is that you’re in debt, and feeling annoyed at yourself that you’ve been spending so much money impulse-buying in-app purchases in Farmville. So you open up your new brain-editing console, look up which neocortical generative models were active the last few times you made a Farmville in-app purchase, and lower their prominence, just a bit.
Then you take a shower, and start thinking about the documentary you saw last night about gestation crates. ‘Man, I’m never going to eat pork again!’ you say to yourself. But you’ve said that many times before, and it’s never stuck. So after the shower, you open up your new brain-editing console, and pull up that memory of the gestation crate documentary and the way you felt after watching it, and set that memory and emotion to activate loudly every time you feel tempted to eat pork, for the rest of your life.
Do you see the direction that things are going? As time goes on, if an agent has the power of both meta-cognition and self-modification, any one of its human-like goals (quasi-goals which are context-dependent, self-contradictory, satisficing, etc.) can gradually transform itself into a utility-function-like goal (which is self-consistent, all-consuming, maximizing)! To be explicit: during the little bits of time when one particular goal happens to be salient and determining behavior, the agent may be motivated to “fix” any part of itself that gets in the way of that goal, until bit by bit, that one goal gradually cements its control over the whole system.
Moreover, if the agent does gradually self-modify from human-like quasi-goals to an all-consuming utility-function-like goal, then I would think it’s very difficult to predict exactly what goal it will wind up having. And most goals have problematic convergent instrumental sub-goals that could make them into x-risks.
...Well, at least, I find this a plausible argument, and don’t see any straightforward way to reliably avoid this kind of goal-transformation. But obviously this is super weird and hard to think about and I’m not very confident. :-)
(I think I stole this line of thought from Eliezer Yudkowsky but can’t find the reference.)
Everything up to here is actually just one of several lines of thought that lead to the conclusion that we might well get an AGI that is trying to maximize a reward.
Another line of thought is what Rohin said: We’ve been using reward functions since forever, so it’s quite possible that we’ll keep doing so.
Another line of thought is: We humans actually have explicit real-world goals, like curing Alzheimer’s and solving climate change etc. And generally the best way to achieve goals is to have an agent seeking them.
Another line of thought is: Different people will try to make AGIs in different ways, and it’s a big world, and (eventually by default) there will be very low barriers-to-entry in building AGIs. So (again by default) sooner or later someone will make an explicitly-goal-seeking AGI, even if thoughtful AGI experts pronounce that doing so is a terrible idea.