I actually do think that the instrumental convergence thesis, specifically, can be mapped over fine, since it’s a fairly abstract principle. For example, this recent paper formalizes the thesis within a standard reinforcement learning framework. I just think that the thesis at most weakly suggests existential doom, unless we add in some other substantive theses. I have some short comments on the paper, explaining my thoughts, here.
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
For the sake of concreteness, consider the algorithm that Facebook uses to create the feed that each user sees (which is an example that Stuart Russell has used). Perhaps there’s very little public information about that algorithm, but it’s reasonable to guess they’re using some deep RL algorithm and a reward function that roughly corresponds to user engagement. Conditioned on that, do you agree that in the limit (i.e. when using whatever algorithm and architecture they’re currently using, at a sufficiently large scale), the arguments about instrumental convergence seem to apply?
This may not be what you have in mind, but: I would be surprised if the FB newsfeed selection algorithm became existentially damaging (e.g. omnicidal), even in the limit of tremendous amounts of training data and compute. I don’t know the algorithm actually works, but as a simplication: let’s imagine that it produces an ordered list of posts to show a user, from the set of recent posts by their friends, and that it’s trained using something like the length of the user’s FB browsing session as the reward. I think that, if you kept training it, nothing too weird would happen. It might produce some unintended social harms (like addiction, polarization, etc.), but the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions). It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes. Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward; in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending (if this is indeed possible given the system’s limited space of possible actions.)
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
Would you say that the treacherous turn argument can also be mapped over to contemporary ML methods (similarly to the instrumental convergence thesis) due to it being a fairly abstract principle?
Also, why is “recursive self-improvement” awkward to fit onto concrete and plausible ML-based development scenarios? (If we ignore the incorrect usage of the word “recursive” here; the concept should have been called “iterative self-improvement”). Consider the work that has been done on neural architecture search via reinforcement learning (this 2016 paper on that topic currently has 1,775 citations on Google Scholars, including 560 citations from 2020). It doesn’t seem extremely unlikely that such a technique will be used, at some point in the future, in some iterative self-improvement setup, in a way that may cause an existential catastrophe.
Regarding the example with the agent that creates the feed of each FB user:
the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions).
I agree that the specified time horizon (and discount factor) is important, and that a shorter time horizon seems safer. But note that FB is incentivized to specify a long time horizon. For example, suppose the feed-creation-agent shows a user a horrible post by some troll, which causes the user to spend many hours in a heated back-and-forth with said troll. Consequently, the user decides FB sucks and ends up getting off FB for many months. If the specified time horizon is sufficiently short (or the discount factor is sufficiently small), then from the perspective of the training process the agent did well when it showed the user that post, and the agent’s policy network will be updated in a way that makes such decisions more likely. FB doesn’t want that. FB’s actual discount factor for users’ engagement time may be very close to 1 (i.e. a user spending an hour on FB today is not 100x more valuable to FB than the user spending an hour on FB next month). This situation is not unique to FB. Many companies that use RL agents that act in the real world have long-term preferences with respect to how their RL agents act.
It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes.
Regarding the “inclination” part: Manipulating the “external world” (what other environment does the feed-creation-agent model?) in the pursuit of certain complex schemes is very useful for maximizing the user engagement metric (that by assumption corresponds to the specified reward function). Also, I don’t see how the “wouldn’t have the ability” part is justified in the limit as the amount of training compute (and architecture size) and data grows to infinity.
Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward
We expect the training process to update the policy network in a way that makes the agent more intelligent (i.e. better at modeling the world and causal chains therein, better at planning, etc.), because that is useful for maximizing the sum of discounted rewards. So I don’t understand how your above argument works, unless you’re arguing that there’s some upper bound on the level of intelligence that we can expect deep RL algorithms to yield, and that upper bound is below the minimum level for an agent to pose existential risk due to instrumental convergence.
in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending
We should expect a sufficiently intelligent agent [EDIT: that acts in the real world] to refrain from behaving in a way that is both unacceptable and conspicuous, as long as we can turn it off (that’s the treacherous turn argument). The question is whether the agent will do something sufficiently alarming and conspicuous before the point where it is intelligent enough to realize it should not cause alarm. I don’t think we can be very confident either way.
Hi Ofer,
Thanks for the comment!
I actually do think that the instrumental convergence thesis, specifically, can be mapped over fine, since it’s a fairly abstract principle. For example, this recent paper formalizes the thesis within a standard reinforcement learning framework. I just think that the thesis at most weakly suggests existential doom, unless we add in some other substantive theses. I have some short comments on the paper, explaining my thoughts, here.
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
This may not be what you have in mind, but: I would be surprised if the FB newsfeed selection algorithm became existentially damaging (e.g. omnicidal), even in the limit of tremendous amounts of training data and compute. I don’t know the algorithm actually works, but as a simplication: let’s imagine that it produces an ordered list of posts to show a user, from the set of recent posts by their friends, and that it’s trained using something like the length of the user’s FB browsing session as the reward. I think that, if you kept training it, nothing too weird would happen. It might produce some unintended social harms (like addiction, polarization, etc.), but the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions). It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes. Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward; in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending (if this is indeed possible given the system’s limited space of possible actions.)
Thanks for the thoughtful reply!
Would you say that the treacherous turn argument can also be mapped over to contemporary ML methods (similarly to the instrumental convergence thesis) due to it being a fairly abstract principle?
Also, why is “recursive self-improvement” awkward to fit onto concrete and plausible ML-based development scenarios? (If we ignore the incorrect usage of the word “recursive” here; the concept should have been called “iterative self-improvement”). Consider the work that has been done on neural architecture search via reinforcement learning (this 2016 paper on that topic currently has 1,775 citations on Google Scholars, including 560 citations from 2020). It doesn’t seem extremely unlikely that such a technique will be used, at some point in the future, in some iterative self-improvement setup, in a way that may cause an existential catastrophe.
Regarding the example with the agent that creates the feed of each FB user:
I agree that the specified time horizon (and discount factor) is important, and that a shorter time horizon seems safer. But note that FB is incentivized to specify a long time horizon. For example, suppose the feed-creation-agent shows a user a horrible post by some troll, which causes the user to spend many hours in a heated back-and-forth with said troll. Consequently, the user decides FB sucks and ends up getting off FB for many months. If the specified time horizon is sufficiently short (or the discount factor is sufficiently small), then from the perspective of the training process the agent did well when it showed the user that post, and the agent’s policy network will be updated in a way that makes such decisions more likely. FB doesn’t want that. FB’s actual discount factor for users’ engagement time may be very close to 1 (i.e. a user spending an hour on FB today is not 100x more valuable to FB than the user spending an hour on FB next month). This situation is not unique to FB. Many companies that use RL agents that act in the real world have long-term preferences with respect to how their RL agents act.
Regarding the “inclination” part: Manipulating the “external world” (what other environment does the feed-creation-agent model?) in the pursuit of certain complex schemes is very useful for maximizing the user engagement metric (that by assumption corresponds to the specified reward function). Also, I don’t see how the “wouldn’t have the ability” part is justified in the limit as the amount of training compute (and architecture size) and data grows to infinity.
We expect the training process to update the policy network in a way that makes the agent more intelligent (i.e. better at modeling the world and causal chains therein, better at planning, etc.), because that is useful for maximizing the sum of discounted rewards. So I don’t understand how your above argument works, unless you’re arguing that there’s some upper bound on the level of intelligence that we can expect deep RL algorithms to yield, and that upper bound is below the minimum level for an agent to pose existential risk due to instrumental convergence.
We should expect a sufficiently intelligent agent [EDIT: that acts in the real world] to refrain from behaving in a way that is both unacceptable and conspicuous, as long as we can turn it off (that’s the treacherous turn argument). The question is whether the agent will do something sufficiently alarming and conspicuous before the point where it is intelligent enough to realize it should not cause alarm. I don’t think we can be very confident either way.