Here’s a fictional dialogue with a generic EA that I think can perhaps helps explain some of my thoughts about AI risks compared to most EAs:
EA: “Future AIs could be unaligned with human values. If this happened, it would likely be catastrophic. If AIs are unaligned, they’ll hide their intentions until they’re in a position to strike in a violent coup, and then the world will end (for us at least).”
Me: “I agree that sounds like it would be very bad. But maybe let’s examine why this scenario seems plausible to you. What do you mean when you say AIs might be unaligned with human values?”
EA: “Their utility functions would not overlap with our utility functions.”
Me: “By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense. Nor does this fact automatically imply the world will end for a given group within humanity.”
EA: “Sure, but that’s because humans mostly all have similar intelligence and/or capability. Future AIs will be way smarter and more capable than humans.”
Me: “Why does that change anything?”
EA: “Because, unlike individual humans, misaligned AIs will be able to take over the world, in the sense of being able to exert vast amounts of hard power over the rest of the inhabitants on the planet. Currently, no human, or faction of humans, can kill everyone else. AIs will be different along this axis.”
Me: “That isn’t different from groups of humans. Various groups have ‘taken over the world’ in that sense. For example, adults currently control the world, and took over from the previous adults. More arguably, smart people currently control the world. In these cases, both groups have considerable hard power relative to other people.
Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be no serious chance the retirement home could stop it. And yet things like that don’t happen very often even though humans have mostly non-overlapping utility functions and considerable hard power.”
EA: “Sure, but that’s because humans respect long-standing norms and laws, and people could never coordinate to do something like that anyway, nor would they want to. AIs won’t necessarily be similar. If unaligned AIs take over the world, they will likely kill us and then replace us, since there’s no reason for them to leave us alive.”
Me: “That seems dubious. Why won’t unaligned AIs respect moral norms? What benefit would they get from killing us? Why are we assuming they all coordinate as a unified group, leaving us out of their coalition?
I’m not convinced, and I think you’re pretty overconfident about these things. But for the sake of argument, let’s just assume for now that you’re right about this particular point. In other words, let’s assume that when AIs take over the world in the future, they’ll kill us and take our place. I certainly agree it would be bad if we all died from AI. But here’s a more fundamental objection: how exactly is that morally different from the fact that young humans already ‘take over’ the world from older people during every generation, letting the old people die, and then take their place?”
EA: “In the case of generational replacement, humans are replaced with other humans. Whereas in this case, humans are replaced by AIs.”
Me: “I’m asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?”
EA: “This is just a bedrock principle to me. I care more about humans than I care about AIs.”
Me: “Fair enough, but I don’t personally intrinsically care more about humans than AIs. I think what matters is plausibly something like sentience, and maybe sapience, and so I don’t have an intrinsic preference for ordinary human generational replacement compared to AIs replacing us.”
EA: “Well, if you care about sentience, unaligned AIs aren’t going to care about that. They’re going to care about other things, like paperclips.”
Me: “Why do you think that?”
EA: “Because (channeling Yudkowsky) sentience is a very narrow target. It’s extremely hard to get an AI to care about something like that. Almost all goals are like paperclip maximization from our perspective, rather than things like welfare maximization.”
Me: “Again, that seems dubious. AIs will be trained on our data, and share similar concepts with us. They’ll be shaped by rewards from human evaluators, and be consciously designed with benevolence in mind. For these reasons, it seems pretty plausible that AIs will come to value natural categories of things, including potentially sentience. I don’t think it makes sense to model their values as being plucked from a uniform distribution over all possible physical configurations.”
EA: “Maybe, but this barely changes the argument. AI values don’t need to be drawn from a uniform distribution over all possible physical configurations to be very different from our own. The point is that they’ll learn values in a way that is completely distinct from how we form our values.”
Me: “That doesn’t seem true. Humans seem to get their moral values from cultural learning and social emulation, which seems broadly similar to the way that AIs will get their moral values. Yes, there are innate human preferences that AIs aren’t likely to share with us—but they are mostly things like preferring being in a room that’s 20°C rather than 10°C. Our moral values—such as the ideology we subscribe to—are more of a product of our cultural environment than anything else. I don’t see why AIs will be very different.”
EA: “That’s not exactly right. Our moral values are also the result of reflection and intrinsic preferences for certain things. For example, humans have empathy, whereas AIs won’t necessarily have that.”
Me: “I agree AIs might not share some fundamental traits with humans, like the capacity for empathy. But ultimately, so what? Is that really the type of thing that makes you more optimistic about a future with humans than with AI? There exist some people who say they don’t feel empathy for others, and yet I would still feel comfortable giving them more power despite that. On the other hand, some people have told me that they feel empathy, but their compassion seems to turn off completely when they watch videos of animals in factory farms. These things seem flimsy as a reason to think that a human-dominated future will have much more value than an AI-dominated future.”
EA: “OK but I think you’re neglecting something pretty obvious here. Even if you think it wouldn’t be morally worse for AIs to replace us compared to young humans replacing us, the latter won’t happen for another several decades, whereas AI could kill everyone and replace us within 10 years. That fact is selfishly concerning—even if we put aside the broader moral argument. And it probably merits pausing AI for at least a decade until we are sure that it won’t kill us.”
Me: “I definitely agree that these things are concerning, and that we should invest heavily into making sure AI goes well. However, I’m not sure why the fact that AI could arrive soon makes much of a difference to you here.
AI could also make our lives much better. It could help us invent cures to aging, and dramatically lift our well-being. If the alternative is certain death, only later in time, then the gamble you’re proposing doesn’t seem clear to me from a selfish perspective.
Whether it’s selfishly worth it to delay AI depends quite a lot on how much safety benefit we’re getting from the delay. Actuarial life tables reveal that we accumulate a lot of risk just by living our lives normally. For example, a 30 year old male in the United States has nearly a 3% chance of dying before they turn 40. I’m not fully convinced that pausing AI for a decade reduces the chance of catastrophe by more than that. And of course, for older people, the gamble is worse still.”
I have so many axes of disagreement that is hard to figure out which one is most relevant. I guess let’s go one by one.
Me: “What do you mean when you say AIs might be unaligned with human values?”
I would say that pretty much every agent other than me (and probably me in different times and moods) are “misaligned” with me, in the sense that I would not like a world where they get to dictate everything that happens without consulting me in any way.
This is a quibble because in fact I think if many people were put in such a position they would try asking others what they want and try to make it happen.
Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be virtually no serious opposition.
This hypothetical assumes too much, because people outside care about the lovely people in the retirement home, and they represent their interests. The question is, will some future AIs with relevance and power care for humans, as humans become obsolete?
I think this is relevant, because in the current world there is a lot of variety. There are people who care about retirement homes and people who don’t. The people who care about retirement homes work hard toale sure retirement homes are well cared for.
But we could imagine a future world where the AI that pulls ahead of the pack is very indifferent about humans, while the AI that cares about humans falls behind; perhaps this is because caring about humans puts you at a disadvantage (if you are not willing to squish humans in your territory your space to build servers gets reduced or something; I think this is unlikely but possible) and/or because there is a winner-take-all mechanism and the first AI systems that gets there coincidentally don’t care about humans (unlikely but possible). Then we would be without representation and in possibly quite a sucky situation.
I’m asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?
Stop that train, I do not want to be replaced by either human or AI. I want to be in the future and have relevance, or at least be empowered through agents that represent my interests.
I also want my fellow humans to be there, if they want to, and have their own interests be represented.
Humans seem to get their moral values from cultural learning and emulation, which seems broadly similar to the way that AIs will get their moral values.
I don’t think AIs learn in a similar way to humans, and future AI might learn in a even more dissimilar way. The argument I would find more persuasive is pointing out that humans learn in different ways to one another, from very different data and situations, and yet end with similar values that include caring for one another. That I find suggestive, though it’s hard to be confident.
EA: “Their utility functions would not overlap with our utility functions.”
Me: “By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense.”
EA: “Sure, but that’s because humans are all roughly the same intelligence and/or capability. Future AIs will be way smarter and more capable than humans.”
Just for the record, this is when I got off the train for this dialogue. I don’t think humans are misaligned with each other in the relevant ways, and if I could press a button to have the universe be optimized by a random human’s coherent extrapolated volition, then that seems great and thousands of times better than what I expect to happen with AI-descendants. I believe this for a mixture of game-theoretic reasons and genuinely thinking that other human’s values do really actually capture most of what I care about.
In this part of the dialogue, when I talk about a utility function of a human, I mean roughly their revealed preferences, rather than their coherent extrapolated volition (which I also think is underspecified). This is important because it is our revealed preferences that better predict our actual behavior, and the point I’m making is simply that behavioral misalignment is common in this sense among humans. And also this fact does not automatically imply the world will end for a given group of humans within humanity.
This is missing a very important point, which is that I think humans have morally relevant experience and I’m not confident that misaligned AIs would. When the next generation replaces the current one this is somewhat ok because those new humans can experience joy, wonder, adventure etc. My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty. (Note that this might be an ok outcome if by default you expect things to be net negative)
I also think that there is way more overlap in the “utility functions” between humans, than between humans and misaligned AIs. Most humans feel empathy and don’t want to cause others harm. I think humans would generally accept small costs to improve the lives of others, and a large part of why people don’t do this is because people have cognitive biases or aren’t thinking clearly. This isn’t to say that any random human would reflectively become a perfectly selfless total utilitarian, but rather that most humans do care about the wellbeing of other humans. By default, I don’t think misaligned AIs will really care about the wellbeing of humans at all.
My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty.
I don’t think that’s particularly likely, but I can understand if you think this is an important crux.
For what it’s worth, I don’t think it matters as much whether the AIs themselves are sentient, but rather whether they care about sentience. For example, from the perspective of sentience, humans weren’t necessarily a great addition to the world, because of their contribution to suffering in animal agriculture (although I’m not giving a confident take here).
Even if AIs are not sentient, they’ll still be responsible for managing the world, and creating structures in the universe. When this happens, there’s a lot of ways for sentience to come about, and I care more about the lower level sentience that the AI manages than the actual AIs at the top who may or may not be sentient.
let’s assume that when AIs take over the world in the future, they’ll kill us and take our place. Here’s a more fundamental objection: how exactly is that morally different from the fact that young humans already ‘take over’ the world from older people during every generation
I think this is a big moral difference: We do not actively kill the older humans so that we can take over. We care about older people, and societies that are rich enough spend some resources to keep older people alive longer.
The entirety of humanity being killed and replaced by the kind of AI that places so little moral value on us humans would be catastrophically bad, compared to things that are currently occurring.
Here’s a fictional dialogue with a generic EA that I think can perhaps helps explain some of my thoughts about AI risks compared to most EAs:
EA: “Future AIs could be unaligned with human values. If this happened, it would likely be catastrophic. If AIs are unaligned, they’ll hide their intentions until they’re in a position to strike in a violent coup, and then the world will end (for us at least).”
Me: “I agree that sounds like it would be very bad. But maybe let’s examine why this scenario seems plausible to you. What do you mean when you say AIs might be unaligned with human values?”
EA: “Their utility functions would not overlap with our utility functions.”
Me: “By that definition, humans are already unaligned with each other. Any given person has almost a completely non-overlapping utility function with a random stranger. People—through their actions rather than words—routinely value their own life and welfare thousands of times higher than that of strangers. Yet even with this misalignment, the world does not end in any strong sense. Nor does this fact automatically imply the world will end for a given group within humanity.”
EA: “Sure, but that’s because humans mostly all have similar intelligence and/or capability. Future AIs will be way smarter and more capable than humans.”
Me: “Why does that change anything?”
EA: “Because, unlike individual humans, misaligned AIs will be able to take over the world, in the sense of being able to exert vast amounts of hard power over the rest of the inhabitants on the planet. Currently, no human, or faction of humans, can kill everyone else. AIs will be different along this axis.”
Me: “That isn’t different from groups of humans. Various groups have ‘taken over the world’ in that sense. For example, adults currently control the world, and took over from the previous adults. More arguably, smart people currently control the world. In these cases, both groups have considerable hard power relative to other people.
Consider a random retirement home. Compared to the rest of the world, it has basically no power. If the rest of humanity decided to destroy or loot the retirement home, there would be no serious chance the retirement home could stop it. And yet things like that don’t happen very often even though humans have mostly non-overlapping utility functions and considerable hard power.”
EA: “Sure, but that’s because humans respect long-standing norms and laws, and people could never coordinate to do something like that anyway, nor would they want to. AIs won’t necessarily be similar. If unaligned AIs take over the world, they will likely kill us and then replace us, since there’s no reason for them to leave us alive.”
Me: “That seems dubious. Why won’t unaligned AIs respect moral norms? What benefit would they get from killing us? Why are we assuming they all coordinate as a unified group, leaving us out of their coalition?
I’m not convinced, and I think you’re pretty overconfident about these things. But for the sake of argument, let’s just assume for now that you’re right about this particular point. In other words, let’s assume that when AIs take over the world in the future, they’ll kill us and take our place. I certainly agree it would be bad if we all died from AI. But here’s a more fundamental objection: how exactly is that morally different from the fact that young humans already ‘take over’ the world from older people during every generation, letting the old people die, and then take their place?”
EA: “In the case of generational replacement, humans are replaced with other humans. Whereas in this case, humans are replaced by AIs.”
Me: “I’m asking why it matters morally. Why should I care if a human takes my place after I die compared to an AI?”
EA: “This is just a bedrock principle to me. I care more about humans than I care about AIs.”
Me: “Fair enough, but I don’t personally intrinsically care more about humans than AIs. I think what matters is plausibly something like sentience, and maybe sapience, and so I don’t have an intrinsic preference for ordinary human generational replacement compared to AIs replacing us.”
EA: “Well, if you care about sentience, unaligned AIs aren’t going to care about that. They’re going to care about other things, like paperclips.”
Me: “Why do you think that?”
EA: “Because (channeling Yudkowsky) sentience is a very narrow target. It’s extremely hard to get an AI to care about something like that. Almost all goals are like paperclip maximization from our perspective, rather than things like welfare maximization.”
Me: “Again, that seems dubious. AIs will be trained on our data, and share similar concepts with us. They’ll be shaped by rewards from human evaluators, and be consciously designed with benevolence in mind. For these reasons, it seems pretty plausible that AIs will come to value natural categories of things, including potentially sentience. I don’t think it makes sense to model their values as being plucked from a uniform distribution over all possible physical configurations.”
EA: “Maybe, but this barely changes the argument. AI values don’t need to be drawn from a uniform distribution over all possible physical configurations to be very different from our own. The point is that they’ll learn values in a way that is completely distinct from how we form our values.”
Me: “That doesn’t seem true. Humans seem to get their moral values from cultural learning and social emulation, which seems broadly similar to the way that AIs will get their moral values. Yes, there are innate human preferences that AIs aren’t likely to share with us—but they are mostly things like preferring being in a room that’s 20°C rather than 10°C. Our moral values—such as the ideology we subscribe to—are more of a product of our cultural environment than anything else. I don’t see why AIs will be very different.”
EA: “That’s not exactly right. Our moral values are also the result of reflection and intrinsic preferences for certain things. For example, humans have empathy, whereas AIs won’t necessarily have that.”
Me: “I agree AIs might not share some fundamental traits with humans, like the capacity for empathy. But ultimately, so what? Is that really the type of thing that makes you more optimistic about a future with humans than with AI? There exist some people who say they don’t feel empathy for others, and yet I would still feel comfortable giving them more power despite that. On the other hand, some people have told me that they feel empathy, but their compassion seems to turn off completely when they watch videos of animals in factory farms. These things seem flimsy as a reason to think that a human-dominated future will have much more value than an AI-dominated future.”
EA: “OK but I think you’re neglecting something pretty obvious here. Even if you think it wouldn’t be morally worse for AIs to replace us compared to young humans replacing us, the latter won’t happen for another several decades, whereas AI could kill everyone and replace us within 10 years. That fact is selfishly concerning—even if we put aside the broader moral argument. And it probably merits pausing AI for at least a decade until we are sure that it won’t kill us.”
Me: “I definitely agree that these things are concerning, and that we should invest heavily into making sure AI goes well. However, I’m not sure why the fact that AI could arrive soon makes much of a difference to you here.
AI could also make our lives much better. It could help us invent cures to aging, and dramatically lift our well-being. If the alternative is certain death, only later in time, then the gamble you’re proposing doesn’t seem clear to me from a selfish perspective.
Whether it’s selfishly worth it to delay AI depends quite a lot on how much safety benefit we’re getting from the delay. Actuarial life tables reveal that we accumulate a lot of risk just by living our lives normally. For example, a 30 year old male in the United States has nearly a 3% chance of dying before they turn 40. I’m not fully convinced that pausing AI for a decade reduces the chance of catastrophe by more than that. And of course, for older people, the gamble is worse still.”
I have so many axes of disagreement that is hard to figure out which one is most relevant. I guess let’s go one by one.
I would say that pretty much every agent other than me (and probably me in different times and moods) are “misaligned” with me, in the sense that I would not like a world where they get to dictate everything that happens without consulting me in any way.
This is a quibble because in fact I think if many people were put in such a position they would try asking others what they want and try to make it happen.
This hypothetical assumes too much, because people outside care about the lovely people in the retirement home, and they represent their interests. The question is, will some future AIs with relevance and power care for humans, as humans become obsolete?
I think this is relevant, because in the current world there is a lot of variety. There are people who care about retirement homes and people who don’t. The people who care about retirement homes work hard toale sure retirement homes are well cared for.
But we could imagine a future world where the AI that pulls ahead of the pack is very indifferent about humans, while the AI that cares about humans falls behind; perhaps this is because caring about humans puts you at a disadvantage (if you are not willing to squish humans in your territory your space to build servers gets reduced or something; I think this is unlikely but possible) and/or because there is a winner-take-all mechanism and the first AI systems that gets there coincidentally don’t care about humans (unlikely but possible). Then we would be without representation and in possibly quite a sucky situation.
Stop that train, I do not want to be replaced by either human or AI. I want to be in the future and have relevance, or at least be empowered through agents that represent my interests.
I also want my fellow humans to be there, if they want to, and have their own interests be represented.
I don’t think AIs learn in a similar way to humans, and future AI might learn in a even more dissimilar way. The argument I would find more persuasive is pointing out that humans learn in different ways to one another, from very different data and situations, and yet end with similar values that include caring for one another. That I find suggestive, though it’s hard to be confident.
Just for the record, this is when I got off the train for this dialogue. I don’t think humans are misaligned with each other in the relevant ways, and if I could press a button to have the universe be optimized by a random human’s coherent extrapolated volition, then that seems great and thousands of times better than what I expect to happen with AI-descendants. I believe this for a mixture of game-theoretic reasons and genuinely thinking that other human’s values do really actually capture most of what I care about.
In this part of the dialogue, when I talk about a utility function of a human, I mean roughly their revealed preferences, rather than their coherent extrapolated volition (which I also think is underspecified). This is important because it is our revealed preferences that better predict our actual behavior, and the point I’m making is simply that behavioral misalignment is common in this sense among humans. And also this fact does not automatically imply the world will end for a given group of humans within humanity.
This is missing a very important point, which is that I think humans have morally relevant experience and I’m not confident that misaligned AIs would. When the next generation replaces the current one this is somewhat ok because those new humans can experience joy, wonder, adventure etc. My best guess is that AIs that take over and replace humans would not have any morally relevant experience, and basically just leave the universe morally empty. (Note that this might be an ok outcome if by default you expect things to be net negative)
I also think that there is way more overlap in the “utility functions” between humans, than between humans and misaligned AIs. Most humans feel empathy and don’t want to cause others harm. I think humans would generally accept small costs to improve the lives of others, and a large part of why people don’t do this is because people have cognitive biases or aren’t thinking clearly. This isn’t to say that any random human would reflectively become a perfectly selfless total utilitarian, but rather that most humans do care about the wellbeing of other humans. By default, I don’t think misaligned AIs will really care about the wellbeing of humans at all.
I don’t think that’s particularly likely, but I can understand if you think this is an important crux.
For what it’s worth, I don’t think it matters as much whether the AIs themselves are sentient, but rather whether they care about sentience. For example, from the perspective of sentience, humans weren’t necessarily a great addition to the world, because of their contribution to suffering in animal agriculture (although I’m not giving a confident take here).
Even if AIs are not sentient, they’ll still be responsible for managing the world, and creating structures in the universe. When this happens, there’s a lot of ways for sentience to come about, and I care more about the lower level sentience that the AI manages than the actual AIs at the top who may or may not be sentient.
I think this is a big moral difference: We do not actively kill the older humans so that we can take over. We care about older people, and societies that are rich enough spend some resources to keep older people alive longer.
The entirety of humanity being killed and replaced by the kind of AI that places so little moral value on us humans would be catastrophically bad, compared to things that are currently occurring.