You suggested in the podcast that it’s not clear how to map some of the classic arguments—and especially their manifestation in thought experiments like the paper clip maximizer—to contemporary machine learning methods. I’d like to push back on that view.
Deep reinforcement learning is a popular contemporary ML approach for training agents that act in simulated and real-world environments. In deep RL, an agent is trained to maximize its reward (more precisely, the sum of discounted rewards over time steps), which perfectly fits the “agent” abstraction that is used throughout the book Superintelligence. I don’t see how classic arguments about the behavior of utility maximizing agents fail to apply to deep RL agents. Suppose we replace every occurrence of the word “agent” in the classic arguments with “deep RL agent”; are the modified arguments false? Here’s the result of doing just that for the instrumental convergence thesis (the original version is from Superintelligence, p. 109):
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the deep RL agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent deep RL agents.
For the sake of concreteness, consider the algorithm that Facebook uses to create the feed that each user sees (which is an example that Stuart Russell has used). Perhaps there’s very little public information about that algorithm, but it’s reasonable to guess they’re using some deep RL algorithm and a reward function that roughly corresponds to user engagement. Conditioned on that, do you agree that in the limit (i.e. when using whatever algorithm and architecture they’re currently using, at a sufficiently large scale), the arguments about instrumental convergence seem to apply?
Regarding the treacherous turn problem, you said:
[...] if you do imagine things would be gradual, then it seems like before you encounter any attempts at deception that have globally catastrophic or existential significance, you probably should expect to see some amount of either failed attempts at deception or attempts at deception that exist, but they’re not totally, totally catastrophic. You should probably see some systems doing this thing of hiding the fact that they have different goals. And notice this before you’re at the point where things are just so, so competent that they’re able to, say, destroy the world or something like that.
Suppose Facebook’s scaled-up-algorithm-for-feed-creation would behave deceptively in some way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so whenever there’s a risk that Facebook engineers would notice. How confident should we be that Facebook engineers would notice the deceptive behavior (i.e. the avoidance of unacceptable behavior in situations where the unacceptable behavior might be noticed)?
I actually do think that the instrumental convergence thesis, specifically, can be mapped over fine, since it’s a fairly abstract principle. For example, this recent paper formalizes the thesis within a standard reinforcement learning framework. I just think that the thesis at most weakly suggests existential doom, unless we add in some other substantive theses. I have some short comments on the paper, explaining my thoughts, here.
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
For the sake of concreteness, consider the algorithm that Facebook uses to create the feed that each user sees (which is an example that Stuart Russell has used). Perhaps there’s very little public information about that algorithm, but it’s reasonable to guess they’re using some deep RL algorithm and a reward function that roughly corresponds to user engagement. Conditioned on that, do you agree that in the limit (i.e. when using whatever algorithm and architecture they’re currently using, at a sufficiently large scale), the arguments about instrumental convergence seem to apply?
This may not be what you have in mind, but: I would be surprised if the FB newsfeed selection algorithm became existentially damaging (e.g. omnicidal), even in the limit of tremendous amounts of training data and compute. I don’t know the algorithm actually works, but as a simplication: let’s imagine that it produces an ordered list of posts to show a user, from the set of recent posts by their friends, and that it’s trained using something like the length of the user’s FB browsing session as the reward. I think that, if you kept training it, nothing too weird would happen. It might produce some unintended social harms (like addiction, polarization, etc.), but the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions). It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes. Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward; in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending (if this is indeed possible given the system’s limited space of possible actions.)
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
Would you say that the treacherous turn argument can also be mapped over to contemporary ML methods (similarly to the instrumental convergence thesis) due to it being a fairly abstract principle?
Also, why is “recursive self-improvement” awkward to fit onto concrete and plausible ML-based development scenarios? (If we ignore the incorrect usage of the word “recursive” here; the concept should have been called “iterative self-improvement”). Consider the work that has been done on neural architecture search via reinforcement learning (this 2016 paper on that topic currently has 1,775 citations on Google Scholars, including 560 citations from 2020). It doesn’t seem extremely unlikely that such a technique will be used, at some point in the future, in some iterative self-improvement setup, in a way that may cause an existential catastrophe.
Regarding the example with the agent that creates the feed of each FB user:
the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions).
I agree that the specified time horizon (and discount factor) is important, and that a shorter time horizon seems safer. But note that FB is incentivized to specify a long time horizon. For example, suppose the feed-creation-agent shows a user a horrible post by some troll, which causes the user to spend many hours in a heated back-and-forth with said troll. Consequently, the user decides FB sucks and ends up getting off FB for many months. If the specified time horizon is sufficiently short (or the discount factor is sufficiently small), then from the perspective of the training process the agent did well when it showed the user that post, and the agent’s policy network will be updated in a way that makes such decisions more likely. FB doesn’t want that. FB’s actual discount factor for users’ engagement time may be very close to 1 (i.e. a user spending an hour on FB today is not 100x more valuable to FB than the user spending an hour on FB next month). This situation is not unique to FB. Many companies that use RL agents that act in the real world have long-term preferences with respect to how their RL agents act.
It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes.
Regarding the “inclination” part: Manipulating the “external world” (what other environment does the feed-creation-agent model?) in the pursuit of certain complex schemes is very useful for maximizing the user engagement metric (that by assumption corresponds to the specified reward function). Also, I don’t see how the “wouldn’t have the ability” part is justified in the limit as the amount of training compute (and architecture size) and data grows to infinity.
Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward
We expect the training process to update the policy network in a way that makes the agent more intelligent (i.e. better at modeling the world and causal chains therein, better at planning, etc.), because that is useful for maximizing the sum of discounted rewards. So I don’t understand how your above argument works, unless you’re arguing that there’s some upper bound on the level of intelligence that we can expect deep RL algorithms to yield, and that upper bound is below the minimum level for an agent to pose existential risk due to instrumental convergence.
in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending
We should expect a sufficiently intelligent agent [EDIT: that acts in the real world] to refrain from behaving in a way that is both unacceptable and conspicuous, as long as we can turn it off (that’s the treacherous turn argument). The question is whether the agent will do something sufficiently alarming and conspicuous before the point where it is intelligent enough to realize it should not cause alarm. I don’t think we can be very confident either way.
Hi Ben,
You suggested in the podcast that it’s not clear how to map some of the classic arguments—and especially their manifestation in thought experiments like the paper clip maximizer—to contemporary machine learning methods. I’d like to push back on that view.
Deep reinforcement learning is a popular contemporary ML approach for training agents that act in simulated and real-world environments. In deep RL, an agent is trained to maximize its reward (more precisely, the sum of discounted rewards over time steps), which perfectly fits the “agent” abstraction that is used throughout the book Superintelligence. I don’t see how classic arguments about the behavior of utility maximizing agents fail to apply to deep RL agents. Suppose we replace every occurrence of the word “agent” in the classic arguments with “deep RL agent”; are the modified arguments false? Here’s the result of doing just that for the instrumental convergence thesis (the original version is from Superintelligence, p. 109):
For the sake of concreteness, consider the algorithm that Facebook uses to create the feed that each user sees (which is an example that Stuart Russell has used). Perhaps there’s very little public information about that algorithm, but it’s reasonable to guess they’re using some deep RL algorithm and a reward function that roughly corresponds to user engagement. Conditioned on that, do you agree that in the limit (i.e. when using whatever algorithm and architecture they’re currently using, at a sufficiently large scale), the arguments about instrumental convergence seem to apply?
Regarding the treacherous turn problem, you said:
Suppose Facebook’s scaled-up-algorithm-for-feed-creation would behave deceptively in some way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so whenever there’s a risk that Facebook engineers would notice. How confident should we be that Facebook engineers would notice the deceptive behavior (i.e. the avoidance of unacceptable behavior in situations where the unacceptable behavior might be noticed)?
Hi Ofer,
Thanks for the comment!
I actually do think that the instrumental convergence thesis, specifically, can be mapped over fine, since it’s a fairly abstract principle. For example, this recent paper formalizes the thesis within a standard reinforcement learning framework. I just think that the thesis at most weakly suggests existential doom, unless we add in some other substantive theses. I have some short comments on the paper, explaining my thoughts, here.
Beyond the instrumental convergence thesis, though, I do think that some bits of the classic arguments are awkward to fit onto concrete and plausible ML-based development scenarios: for example, the focus on recursive self-improvement, and the use of thought experiments in which natural language commands, when interpretted literally and single-mindedly, lead to unforeseen bad behaviors. I think that Reframing Superintelligence does a good job of pointing out some of the tensions between classic ways of thinking and talking about AI risk and current/plausible ML engineering practices.
This may not be what you have in mind, but: I would be surprised if the FB newsfeed selection algorithm became existentially damaging (e.g. omnicidal), even in the limit of tremendous amounts of training data and compute. I don’t know the algorithm actually works, but as a simplication: let’s imagine that it produces an ordered list of posts to show a user, from the set of recent posts by their friends, and that it’s trained using something like the length of the user’s FB browsing session as the reward. I think that, if you kept training it, nothing too weird would happen. It might produce some unintended social harms (like addiction, polarization, etc.), but the system wouldn’t, in any meaningful sense, have long-run objectives (due to the shortness of sessions). It also probably wouldn’t have the ability or inclination to manipulate the external world in the pursuit of complex schemes. Figuring out how to manipulate the external world in precise ways would require a huge amount of very weird exploration, deep in a section of the space of possible policies where most of the policies are terrible at maximizing reward; in the unlikely event that the necessary exploration happened, and the policy started moving in this direction, I think it would be conspicuous before the newsfeed selection algorithm does something like kill everyone to prevent ongoing FB sessions from ending (if this is indeed possible given the system’s limited space of possible actions.)
Thanks for the thoughtful reply!
Would you say that the treacherous turn argument can also be mapped over to contemporary ML methods (similarly to the instrumental convergence thesis) due to it being a fairly abstract principle?
Also, why is “recursive self-improvement” awkward to fit onto concrete and plausible ML-based development scenarios? (If we ignore the incorrect usage of the word “recursive” here; the concept should have been called “iterative self-improvement”). Consider the work that has been done on neural architecture search via reinforcement learning (this 2016 paper on that topic currently has 1,775 citations on Google Scholars, including 560 citations from 2020). It doesn’t seem extremely unlikely that such a technique will be used, at some point in the future, in some iterative self-improvement setup, in a way that may cause an existential catastrophe.
Regarding the example with the agent that creates the feed of each FB user:
I agree that the specified time horizon (and discount factor) is important, and that a shorter time horizon seems safer. But note that FB is incentivized to specify a long time horizon. For example, suppose the feed-creation-agent shows a user a horrible post by some troll, which causes the user to spend many hours in a heated back-and-forth with said troll. Consequently, the user decides FB sucks and ends up getting off FB for many months. If the specified time horizon is sufficiently short (or the discount factor is sufficiently small), then from the perspective of the training process the agent did well when it showed the user that post, and the agent’s policy network will be updated in a way that makes such decisions more likely. FB doesn’t want that. FB’s actual discount factor for users’ engagement time may be very close to 1 (i.e. a user spending an hour on FB today is not 100x more valuable to FB than the user spending an hour on FB next month). This situation is not unique to FB. Many companies that use RL agents that act in the real world have long-term preferences with respect to how their RL agents act.
Regarding the “inclination” part: Manipulating the “external world” (what other environment does the feed-creation-agent model?) in the pursuit of certain complex schemes is very useful for maximizing the user engagement metric (that by assumption corresponds to the specified reward function). Also, I don’t see how the “wouldn’t have the ability” part is justified in the limit as the amount of training compute (and architecture size) and data grows to infinity.
We expect the training process to update the policy network in a way that makes the agent more intelligent (i.e. better at modeling the world and causal chains therein, better at planning, etc.), because that is useful for maximizing the sum of discounted rewards. So I don’t understand how your above argument works, unless you’re arguing that there’s some upper bound on the level of intelligence that we can expect deep RL algorithms to yield, and that upper bound is below the minimum level for an agent to pose existential risk due to instrumental convergence.
We should expect a sufficiently intelligent agent [EDIT: that acts in the real world] to refrain from behaving in a way that is both unacceptable and conspicuous, as long as we can turn it off (that’s the treacherous turn argument). The question is whether the agent will do something sufficiently alarming and conspicuous before the point where it is intelligent enough to realize it should not cause alarm. I don’t think we can be very confident either way.