Hi Ben. I just read the transcript of your 80,000 Hours interview and am curious how you’d respond to the following:
Analogy to agriculture, industry
You say that it would be hard for a single person (or group?) acting far before the agricultural revolution or industrial revolution to impact how those things turned out, so we should be skeptical that we can have much effect now on how an AI revolution turns out.
Do you agree that the goodness of this analogy is roughly proportional to how slow our AI takeoff is? For instance if the first AGI ever created becomes more powerful than the rest of the world, then it seems that anyone who influenced the properties of this AGI would have a huge impact on the future.
Brain-in-a-box
You argue that if we transition more smoothly from super powerful narrow AIs that slowly expand in generality to AGI, we’ll be less caught off guard / better prepared.
It seems that even in a relatively slow takeoff, you wouldn’t need that big of a discontinuity to result in a singleton AI scenario. If the first AGI that’s significantly more generally intelligent than a human is created in a world where lots of powerful narrow AIs exist, wouldn’t having a super smart thing at the center of control of a bunch of narrow AI tools plausibly be way more powerful than having human brains at the center of that control?
It seems plausible that in a “smooth” scenario the time between when the first group created AGI and the second group creating an equally powerful one could be months apart. Do you think a months-long discontinuity is not enough for an AGI to pull sufficiently ahead?
Even if multiple groups create AGIs within a short time, isn’t having a bunch of unaligned AGIs all trying to get power at the same time also an existential risk? It doesn’t seem clear that they’d automatically keep each other in check. One might simply be better at growing or better at sabotaging other AIs. Or if they reach a stalemate they might start cooperating with each other to achieve unaligned goals as a compromise.
Maybe narrow AIs will work better
You say that since today’s AIs are narrow, and since there’s often benefit in specialization, maybe in the future specialized AIs will continue to dominate. You say “maybe the optimal level of generality actually isn’t that high.”
My model is: if you have a central control unit (a human brain, or group of human brains) who is deciding how to use a bunch of narrow AIs, then if you replace that central control unit with one that it more intelligent / fast acting, the whole system will be more effective.
The only way I can think of where that wouldn’t be true would be if the general AI required so many computational resources that the narrow AIs that were acting as tools of the AGI were crippled by lack of resources. Is that what you’re imagining?
Deadline model of AI progress
You say you disagree with the idea that the day when we create AGI acts as a sort of ‘deadline’, and if we don’t figure out alignment before then we’re screwed.
A lot of your argument is about how increasing AI capability and alignment are intertwined processes, so that as we increase an AI’s capabilities we’re also increasing its alignment. You discuss how it’s not like we’re going to create a super powerful AI and then give it a module with its goals at the end of the process.
I agree with that, but I don’t see it as substantially affecting the Bostrom/Yudkowsky arguments.
Isn’t the idea that we would have something that seemed aligned as we were training it (based on this continuous feedback we were giving it), but then only when it became extremely powerful we’d realize it wasn’t actually aligned?
This seems to be a disagreement about “how hard is AI alignment?”. I think Yudkowsky would say that it’s super hard such that your AI can look perfectly aligned when it’s less powerful than you, but you get something slightly wrong that only manifests itself when it has taken over. Do you agree that’s a crux?
You talk about how AIs can behave very differently in different environments. Isn’t the environment of an AI which happens to be the most powerful agent on earth fundamentally different than the any environment we could provide when training an AI (in terms of resources at its disposal, strategies it might be aware of, etc)?
Instrumental convergence
You talk about how even if almost all goals would result in instrumental convergence, we’re free to pick any goals we like, so we can pick from a very small subset of all goals which don’t result in instrumental convergence.
It seems like there’s a tradeoff between AI capability and not exhibiting instrumental convergence, since to avoid instrumental convergence you basically need to tell the AI “You’re not allowed to do anything in this broad class of things that will help you achieve your goals.” An AI that amasses power and is willing to kill to achieve its goals is by definition more powerful than one that eschews becoming powerful and killing.
In a situation where they may be many groups trying to create an AGI, doesn’t this imply that the first AGI that does exhibit instrumental convergence will have a huge advantage over any others?
Thanks for all the questions and comments! I’ll answer this one in stages.
On your first question:
Do you agree that the goodness of this analogy is roughly proportional to how slow our AI takeoff is? For instance if the first AGI ever created becomes more powerful than the rest of the world, then it seems that anyone who influenced the properties of this AGI would have a huge impact on the future.
I agree with this.
To take the fairly extreme case of the Neolithic Revolution, I think that there are at least a few reasons why groups at the time would have had trouble steering the future. One key reason is what the world was highly “anarchic,” in the international relations sense of the term: there were many different political communities, with divergent interests and a limited ability to either coerce one another or form credible commitments. One result of anarchy is that, if the adoption of some technology or cultural/institutional practice would give some group an edge, then it’s almost bound to be adopted by some group at some point: other groups will need to either lose influence or adopt the technology/innovation to avoid subjugation. This explains why the emergence and gradual spread of agricultural civilization was close to inevitable, even though (there’s some evidence) people often preferred the hunter-gatherer way of life. There was an element of technological or economic determinism that put the course of history outside of any individual group’s control (at least to a significant degree).
Another issue, in the context of the Neolithic Revolution, is that norms, institutions, etc., tend to shift over time, even in there aren’t very strong selection pressures. This was even more true before the advent of writing. So we do have a few examples of religious or philosophical traditions that have stuck around, at least in mutated forms, for a couple thousand years; but this is unlikely, in any individual case, and would have been even more unlikely 10,000 years ago. At least so far, we also don’t have examples of more formal political institutions (e.g. constitutions) that have largely stuck around for more than few thousand years either.
There are a couple reasons why AI could be different. The first reason is that—under certain scenarios, especially ones with highly discontinuous and centralized progress—it’s perhaps more likely that one political community will become much more powerful than all others and thereby make the world less “anarchic.” Another is that, especially if the world is non-anarchic, values and institutions might naturally be more stable in a heavily AI-based world. It seems plausible that humans will eventually step almost completely out of the loop, even if they don’t do this immediately after extremely high levels of automation are achieved. At this point, if one particular group has disproportionate influence over the design/use of existing AI systems, then that one group might indeed have a ton of influence over the long-run future.
Goal preservation is the idea that an agent or civilization might eventually prevent goal drift over time, except perhaps in cases where its current goals approve of goal changes. While consequentialist agents have strong incentive to work toward goal preservation, implementing it in non-trivial, and especially in chaotic, systems seems very difficult. It’s unclear to me how likely a future superintelligent civilization is to ultimately preserve its goals. Even if it does so, there may be significant goal drift between the values of present-day humans and the ultimate goals that a future advanced civilization locks in.
In set theory, a singleton is a set with only one member, but as I introduced the notion, the term refers to a world order in which there is a single decision-making agency at the highest level.[1] Among its powers would be (1) the ability to prevent any threats (internal or external) to its own existence and supremacy, and (2) the ability to exert effective control over major features of its domain (including taxation and territorial allocation).
You say you disagree with the idea that the day when we create AGI acts as a sort of ‘deadline’, and if we don’t figure out alignment before then we’re screwed.
A lot of your argument is about how increasing AI capability and alignment are intertwined processes, so that as we increase an AI’s capabilities we’re also increasing its alignment. You discuss how it’s not like we’re going to create a super powerful AI and then give it a module with its goals at the end of the process.
I agree with that, but I don’t see it as substantially affecting the Bostrom/Yudkowsky arguments.
Isn’t the idea that we would have something that seemed aligned as we were training it (based on this continuous feedback we were giving it), but then only when it became extremely powerful we’d realize it wasn’t actually aligned?
I think there are a couple different bits to my thinking here, which I sort of smush together in the interview.
The first bit is that, when developing an individual AI system, its goals and capabilities/intelligence tend to take shape together. This is helpful, since it increases the odds that we’ll notice issues with the system’s emerging goals before they result in truly destructive behavior. Even if someone didn’t expect a purely dust-minimizing house-cleaning robot to be a bad idea, for example, they’ll quickly realize their mistake as they train the system. The mistake will be clear well before the point when the simulated robot learns how to take over the world; it will probably be clear even before the point when the robot learns how to operate door knobs.
The second bit is that there are many contexts in which pretty much any possible hand-coded reward function will either quickly reveal itself as inappropriate or be obviously inappropriate before the training process even begins. This means that sane people won’t proceed in developing and deploying things like house-cleaning robots or city planners until they’ve worked out alignment techniques to some degree; they’ll need to wait until we’ve moved beyond “hand-coding” preferences, toward processes that more heavily involve ML systems learning what behaviors users or developers prefer.
It’s still conceivable that, even given these considerations, people will still accidentally develop AI systems that commit omnicide (or cause similarly grave harms). But the likelihood at least goes down. First of all, it needs to be the case that (a): training processes that use apparently promising alignment techniques will still converge on omnicidal systems. Second, it needs to be the case that (b): people won’t notice that these training processes have serious issues until they’ve actually made omnicidal AI systems.
I’m skeptical of both (a) and (b). My intuition, regarding (a), is that some method that involves learning human preferences would need to be really terrible to result in systems that are doing things on the order of mass murder. Although some arguments related to mesa-optimization may push against this intuition.
Then my intuition, regarding (b), is that the techniques would likely display serious issues before anyone creates a system capable of omnicide. For example, if these techniques tend to induce systems to engage in deceptive behaviors, I would expect there to be some signs that this is an issue early on; I would expect some failed or non-catastrophic acts of deception to be observed first. However, again, my intuition is closely tied to my expectation that progress will be pretty continuous. A key thing to keep in mind about highly continuous scenarios is that there’s not just one single consequential ML training run, where the ML system might look benign at the start but turn around and take over the world at the end. We’re instead talking about countless training runs, used to develop a wide variety of different systems of intermediate generality and competency, deployed across a wide variety of domains, over a period of multiple years. We would have many more opportunities to notice issues with available techniques than we would in a “brain in a box” scenario. In a more discontinuous scenario, the risk would presumably be higher.
This seems to be a disagreement about “how hard is AI alignment?”.
This might just be a matter of semantics, but I don’t think “how hard is AI alignment?” is the main question I have in mind here. I’m mostly thinking about the question of whether we’ll unwittingly create existentially damaging systems, if we don’t work out alignment techniques first. For example, if we don’t know how to make benign house cleaners, city planners, or engineers by year X, will we unwittingly create omnicidal systems instead? Certainly, the harder it is to work out alignment techniques, the higher the risks become. But it’s possible for accident risk to be low even if alignment techniques are very hard to work out.
It seems that even in a relatively slow takeoff, you wouldn’t need that big of a discontinuity to result in a singleton AI scenario. If the first AGI that’s significantly more generally intelligent than a human is created in a world where lots of powerful narrow AIs exist, wouldn’t having a super smart thing at the center of control of a bunch of narrow AI tools plausibly be way more powerful than having human brains at the center of that control?
It seems plausible that in a “smooth” scenario the time between when the first group created AGI and the second group creating an equally powerful one could be months apart. Do you think a months-long discontinuity is not enough for an AGI to pull sufficiently ahead?
I would say that, in a scenario with relatively “smooth” progress, there’s not really a clean distinction between “narrow” AI systems and “general” AI systems; the line between “we have AGI” and “we don’t have AGI” is either a bit blurry or a bit arbitarily drawn. Even if the management/control of large collections of AI systems is eventually automated, I would also expect this process of automation to unfold over time rather than happening in single go.
In general, the smoother things are, the harder it is to tell a story where one group gets out way ahead of others. Although I’m unsure just how “unsmooth” things need to be for this outcome to be plausible.
Even if multiple groups create AGIs within a short time, isn’t having a bunch of unaligned AGIs all trying to get power at the same time also an existential risk? It doesn’t seem clear that they’d automatically keep each other in check. One might simply be better at growing or better at sabotaging other AIs. Or if they reach a stalemate they might start cooperating with each other to achieve unaligned goals as a compromise.
I think that if there were multiple AGI or AGI-ish systems in the world, and most of them were badly misaligned (e.g. willing to cause human extinction for instrumental reasons), this would present an existential risk. I wouldn’t count on them balancing each other out, in the same way that endangered gorilla populations shouldn’t count on warring communities to balance each other out.
I think the main benefits of smoothness have to do with risk awareness (e.g. by observing less catastrophic mishaps) and, especially, with opportunities for trial-and-error learning. At least when the concern is misalignment risk, I don’t think of the decentralization of power as a really major benefit in its own right: the systems in this decentralized world still mostly need to be safe.
My model is: if you have a central control unit (a human brain, or group of human brains) who is deciding how to use a bunch of narrow AIs, then if you replace that central control unit with one that it more intelligent / fast acting, the whole system will be more effective.
The only way I can think of where that wouldn’t be true would be if the general AI required so many computational resources that the narrow AIs that were acting as tools of the AGI were crippled by lack of resources. Is that what you’re imagining?
I think it’s plausible that especially general systems would be especially useful for managing the development, deployment, and interaction of other AI systems. I’m not totally sure this is the case, though. For example, at least in principle, I can imagine an AI system that is good at managing the training of other AI systems—e.g. deciding how much compute to devote to different ongoing training processes—but otherwise can’t do much else.
Hi Ben. I just read the transcript of your 80,000 Hours interview and am curious how you’d respond to the following:
Analogy to agriculture, industry
You say that it would be hard for a single person (or group?) acting far before the agricultural revolution or industrial revolution to impact how those things turned out, so we should be skeptical that we can have much effect now on how an AI revolution turns out.
Do you agree that the goodness of this analogy is roughly proportional to how slow our AI takeoff is? For instance if the first AGI ever created becomes more powerful than the rest of the world, then it seems that anyone who influenced the properties of this AGI would have a huge impact on the future.
Brain-in-a-box
You argue that if we transition more smoothly from super powerful narrow AIs that slowly expand in generality to AGI, we’ll be less caught off guard / better prepared.
It seems that even in a relatively slow takeoff, you wouldn’t need that big of a discontinuity to result in a singleton AI scenario. If the first AGI that’s significantly more generally intelligent than a human is created in a world where lots of powerful narrow AIs exist, wouldn’t having a super smart thing at the center of control of a bunch of narrow AI tools plausibly be way more powerful than having human brains at the center of that control?
It seems plausible that in a “smooth” scenario the time between when the first group created AGI and the second group creating an equally powerful one could be months apart. Do you think a months-long discontinuity is not enough for an AGI to pull sufficiently ahead?
Even if multiple groups create AGIs within a short time, isn’t having a bunch of unaligned AGIs all trying to get power at the same time also an existential risk? It doesn’t seem clear that they’d automatically keep each other in check. One might simply be better at growing or better at sabotaging other AIs. Or if they reach a stalemate they might start cooperating with each other to achieve unaligned goals as a compromise.
Maybe narrow AIs will work better
You say that since today’s AIs are narrow, and since there’s often benefit in specialization, maybe in the future specialized AIs will continue to dominate. You say “maybe the optimal level of generality actually isn’t that high.”
My model is: if you have a central control unit (a human brain, or group of human brains) who is deciding how to use a bunch of narrow AIs, then if you replace that central control unit with one that it more intelligent / fast acting, the whole system will be more effective.
The only way I can think of where that wouldn’t be true would be if the general AI required so many computational resources that the narrow AIs that were acting as tools of the AGI were crippled by lack of resources. Is that what you’re imagining?
Deadline model of AI progress
You say you disagree with the idea that the day when we create AGI acts as a sort of ‘deadline’, and if we don’t figure out alignment before then we’re screwed.
A lot of your argument is about how increasing AI capability and alignment are intertwined processes, so that as we increase an AI’s capabilities we’re also increasing its alignment. You discuss how it’s not like we’re going to create a super powerful AI and then give it a module with its goals at the end of the process.
I agree with that, but I don’t see it as substantially affecting the Bostrom/Yudkowsky arguments.
Isn’t the idea that we would have something that seemed aligned as we were training it (based on this continuous feedback we were giving it), but then only when it became extremely powerful we’d realize it wasn’t actually aligned?
This seems to be a disagreement about “how hard is AI alignment?”. I think Yudkowsky would say that it’s super hard such that your AI can look perfectly aligned when it’s less powerful than you, but you get something slightly wrong that only manifests itself when it has taken over. Do you agree that’s a crux?
You talk about how AIs can behave very differently in different environments. Isn’t the environment of an AI which happens to be the most powerful agent on earth fundamentally different than the any environment we could provide when training an AI (in terms of resources at its disposal, strategies it might be aware of, etc)?
Instrumental convergence
You talk about how even if almost all goals would result in instrumental convergence, we’re free to pick any goals we like, so we can pick from a very small subset of all goals which don’t result in instrumental convergence.
It seems like there’s a tradeoff between AI capability and not exhibiting instrumental convergence, since to avoid instrumental convergence you basically need to tell the AI “You’re not allowed to do anything in this broad class of things that will help you achieve your goals.” An AI that amasses power and is willing to kill to achieve its goals is by definition more powerful than one that eschews becoming powerful and killing.
In a situation where they may be many groups trying to create an AGI, doesn’t this imply that the first AGI that does exhibit instrumental convergence will have a huge advantage over any others?
Hi Elliot,
Thanks for all the questions and comments! I’ll answer this one in stages.
On your first question:
I agree with this.
To take the fairly extreme case of the Neolithic Revolution, I think that there are at least a few reasons why groups at the time would have had trouble steering the future. One key reason is what the world was highly “anarchic,” in the international relations sense of the term: there were many different political communities, with divergent interests and a limited ability to either coerce one another or form credible commitments. One result of anarchy is that, if the adoption of some technology or cultural/institutional practice would give some group an edge, then it’s almost bound to be adopted by some group at some point: other groups will need to either lose influence or adopt the technology/innovation to avoid subjugation. This explains why the emergence and gradual spread of agricultural civilization was close to inevitable, even though (there’s some evidence) people often preferred the hunter-gatherer way of life. There was an element of technological or economic determinism that put the course of history outside of any individual group’s control (at least to a significant degree).
Another issue, in the context of the Neolithic Revolution, is that norms, institutions, etc., tend to shift over time, even in there aren’t very strong selection pressures. This was even more true before the advent of writing. So we do have a few examples of religious or philosophical traditions that have stuck around, at least in mutated forms, for a couple thousand years; but this is unlikely, in any individual case, and would have been even more unlikely 10,000 years ago. At least so far, we also don’t have examples of more formal political institutions (e.g. constitutions) that have largely stuck around for more than few thousand years either.
There are a couple reasons why AI could be different. The first reason is that—under certain scenarios, especially ones with highly discontinuous and centralized progress—it’s perhaps more likely that one political community will become much more powerful than all others and thereby make the world less “anarchic.” Another is that, especially if the world is non-anarchic, values and institutions might naturally be more stable in a heavily AI-based world. It seems plausible that humans will eventually step almost completely out of the loop, even if they don’t do this immediately after extremely high levels of automation are achieved. At this point, if one particular group has disproportionate influence over the design/use of existing AI systems, then that one group might indeed have a ton of influence over the long-run future.
Thanks to Ben for doing this AMA, and to Elliot for this interesting set of questions!
Just wanted to mention two links that readers might find interesting in this context. Firstly, Tomasik’s Will Future Civilization Eventually Achieve Goal Preservation? Here’s the summary:
Secondly, Bostrom’s What is a Singleton? Here’s a quote:
I think there are a couple different bits to my thinking here, which I sort of smush together in the interview.
The first bit is that, when developing an individual AI system, its goals and capabilities/intelligence tend to take shape together. This is helpful, since it increases the odds that we’ll notice issues with the system’s emerging goals before they result in truly destructive behavior. Even if someone didn’t expect a purely dust-minimizing house-cleaning robot to be a bad idea, for example, they’ll quickly realize their mistake as they train the system. The mistake will be clear well before the point when the simulated robot learns how to take over the world; it will probably be clear even before the point when the robot learns how to operate door knobs.
The second bit is that there are many contexts in which pretty much any possible hand-coded reward function will either quickly reveal itself as inappropriate or be obviously inappropriate before the training process even begins. This means that sane people won’t proceed in developing and deploying things like house-cleaning robots or city planners until they’ve worked out alignment techniques to some degree; they’ll need to wait until we’ve moved beyond “hand-coding” preferences, toward processes that more heavily involve ML systems learning what behaviors users or developers prefer.
It’s still conceivable that, even given these considerations, people will still accidentally develop AI systems that commit omnicide (or cause similarly grave harms). But the likelihood at least goes down. First of all, it needs to be the case that (a): training processes that use apparently promising alignment techniques will still converge on omnicidal systems. Second, it needs to be the case that (b): people won’t notice that these training processes have serious issues until they’ve actually made omnicidal AI systems.
I’m skeptical of both (a) and (b). My intuition, regarding (a), is that some method that involves learning human preferences would need to be really terrible to result in systems that are doing things on the order of mass murder. Although some arguments related to mesa-optimization may push against this intuition.
Then my intuition, regarding (b), is that the techniques would likely display serious issues before anyone creates a system capable of omnicide. For example, if these techniques tend to induce systems to engage in deceptive behaviors, I would expect there to be some signs that this is an issue early on; I would expect some failed or non-catastrophic acts of deception to be observed first. However, again, my intuition is closely tied to my expectation that progress will be pretty continuous. A key thing to keep in mind about highly continuous scenarios is that there’s not just one single consequential ML training run, where the ML system might look benign at the start but turn around and take over the world at the end. We’re instead talking about countless training runs, used to develop a wide variety of different systems of intermediate generality and competency, deployed across a wide variety of domains, over a period of multiple years. We would have many more opportunities to notice issues with available techniques than we would in a “brain in a box” scenario. In a more discontinuous scenario, the risk would presumably be higher.
This might just be a matter of semantics, but I don’t think “how hard is AI alignment?” is the main question I have in mind here. I’m mostly thinking about the question of whether we’ll unwittingly create existentially damaging systems, if we don’t work out alignment techniques first. For example, if we don’t know how to make benign house cleaners, city planners, or engineers by year X, will we unwittingly create omnicidal systems instead? Certainly, the harder it is to work out alignment techniques, the higher the risks become. But it’s possible for accident risk to be low even if alignment techniques are very hard to work out.
I would say that, in a scenario with relatively “smooth” progress, there’s not really a clean distinction between “narrow” AI systems and “general” AI systems; the line between “we have AGI” and “we don’t have AGI” is either a bit blurry or a bit arbitarily drawn. Even if the management/control of large collections of AI systems is eventually automated, I would also expect this process of automation to unfold over time rather than happening in single go.
In general, the smoother things are, the harder it is to tell a story where one group gets out way ahead of others. Although I’m unsure just how “unsmooth” things need to be for this outcome to be plausible.
I think that if there were multiple AGI or AGI-ish systems in the world, and most of them were badly misaligned (e.g. willing to cause human extinction for instrumental reasons), this would present an existential risk. I wouldn’t count on them balancing each other out, in the same way that endangered gorilla populations shouldn’t count on warring communities to balance each other out.
I think the main benefits of smoothness have to do with risk awareness (e.g. by observing less catastrophic mishaps) and, especially, with opportunities for trial-and-error learning. At least when the concern is misalignment risk, I don’t think of the decentralization of power as a really major benefit in its own right: the systems in this decentralized world still mostly need to be safe.
I think it’s plausible that especially general systems would be especially useful for managing the development, deployment, and interaction of other AI systems. I’m not totally sure this is the case, though. For example, at least in principle, I can imagine an AI system that is good at managing the training of other AI systems—e.g. deciding how much compute to devote to different ongoing training processes—but otherwise can’t do much else.