Two sources of human misalignment that may resist a long reflection: malevolence and ideological fanaticism
(Alternative title: Some bad human values may corrupt a long reflection[1])
The values of some humans, even if idealized (e.g., during some form of long reflection), may be incompatible with an excellent future. Thus, solving AI alignment will not necessarily lead to utopia.
Others have raised similar concerns before.[2] Joe Carlsmith puts it especially well in the post “An even deeper atheism”:
“And now, of course, the question arises: how different, exactly, are human hearts from each other? And in particular: are they sufficiently different that, when they foom, and even “on reflection,” they don’t end up pointing in exactly the same direction? After all, Yudkowsky said, above, that in order for the future to be non-trivially “of worth,” human hearts have to be in the driver’s seat. But even setting aside the insult, here, to the dolphins, bonobos, nearest grabby aliens, and so on – still, that’s only to specify a necessary condition. Presumably, though, it’s not a sufficient condition? Presumably some human hearts would be bad drivers, too? Like, I dunno, Stalin?”
What makes human hearts bad?
What, exactly, makes some human hearts bad drivers? If we better understood what makes hearts go bad, perhaps we could figure out how to make bad hearts good or at least learn how to prevent hearts from going bad. It would also allow us better spot potentially bad hearts and coordinate our efforts to prevent them from taking the driving seat.
As of now, I’m most worried about malevolent personality traits and fanatical ideologies.[3]
Malevolence: dangerous personality traits
Some human hearts may be corrupted due to elevated malevolent traits like psychopathy, sadism, narcissism, Machiavellianism, or spitefulness.
Ideological fanaticism: dangerous belief systems
There are many suitable definitions of “ideological fanaticism”. Whatever definition we are going to use, it should describe ideologies that have caused immense harm historically, such as fascism (Germany under Hitler, Italy under Mussolini), (extreme) communism (the Soviet Union under Stalin, China under Mao), religious fundamentalism (ISIS, the Inquisition), and most cults.
See this footnote[4] for a preliminary list of defining characteristics.
Malevolence and fanaticism seem especially dangerous
Of course, there are other factors that could corrupt our hearts or driving ability. For example, cognitive biases, limited cognitive ability, philosophical confusions, or plain old selfishness.[5] I’m most concerned about malevolence and ideological fanaticism for two reasons.
Deliberately resisting reflection and idealization
First, malevolence—if reflectively endorsed[6]—and fanatical ideologies deliberately resist being changed and would thus plausibly resist idealization even during a long reflection. The most central characteristic of fanatical ideologies is arguably that they explicitly forbid criticism, questioning, and belief change and view doubters and disagreement as evil.
Putting positive value on creating harm
Second, malevolence and ideological fanaticism would not only result in the future not being as good as it possibly could—they might actively steer the future in bad directions and, for instance, result in astronomical amounts of suffering.
The preferences of malevolent humans (e.g., sadists) may be such that they intrinsically enjoy inflicting suffering on others. Similarly, many fanatical ideologies sympathize with excessive retributivism and often demonize the outgroup. Enabled by future technology, preferences for inflicting suffering on the outgroup may result in enormous disvalue—cf. concentration camps, the Gulag, or hell[7].
In the future, I hope to write more about all of this, especially long-term risks from ideological fanaticism.
Thanks to Pablo and Ruairi for comments and valuable discussions.
These factors may not be clearly separable. Some humans may be more attracted to fanatical ideologies due to their psychological traits and malevolent humans are often leading fanatical ideologies. Also, believing and following a fanatical ideology may not be good for your heart.
Below are some typical characteristics (I’m no expert in this area):
Unquestioning belief, absolute certainty and rigid adherence. The principles and beliefs of the ideology are seen as absolute truth and questioning or critical examination is forbidden.
Inflexibility and refusal to compromise.
Intolerance and hostility towards dissent. Anyone who disagrees or challenges the ideology is seen as evil; as enemies, traitors, or heretics.
Ingroup superiority and outgroup demonization. The in-group is viewed as superior, chosen, or enlightened. The out-group is often demonized and blamed for the world’s problems.
Authoritarianism. Fanatical ideologies often endorse (or even require) a strong, centralized authority to enforce their principles and suppress opposition, potentially culminating in dictatorship or totalitarianism.
Militancy and willingness to use violence.
Utopian vision. Many fanatical ideologies are driven by a vision of a perfect future or afterlife which can only be achieved through strict adherence to the ideology. This utopian vision often justifies extreme measures in the present.
For example, Barnett argues that future technology will be primarily used to satisfy economic consumption (aka selfish desires). That seems even plausible to me, however, I’m not that concerned about this causing huge amounts of future suffering (at least compared to other s-risks). It seems to me that most humans place non-trivial value on the welfare of (neutral) others such as animals. Right now, this preference (for most people) isn’t strong enough to outweigh the selfish benefits of eating meat. However, I’m relatively hopeful that future technology would make such types of tradeoffs much less costly.
Some people (how many?) with elevated malevolent traits don’t reflectively endorse their malevolent urges and would change them if they could. However, some of them do reflectively endorse their malevolent preferences and view empathy as weakness.
Thomas Aquinas: “the blessed will rejoice in the punishment of the wicked.” “In order that the happiness of the saints may be more delightful to them and that they may render more copious thanks to God for it, they are allowed to see perfectly the sufferings of the damned”.
Samuel Hopkins: “Should the fire of this eternal punishment cease, it would in a great measure obscure the light of heaven, and put an end to a great part of the happiness and glory of the blessed.”
Jonathan Edwards: “The sight of hell torments will exalt the happiness of the saints forever.”
(Unimportant discussion of probably useless and confused terminology.)
I sometimes use terms like “inner existential risks” to refer to risk factors like malevolence and fanaticism. Inner existential risks primarily arise from “within the human heart”—that is, they are primarily related to the values, goals and/or beliefs of (some) humans.
My sense is that most x-risk discourse focuses on outer existential risks, that is, x-risks which primarily arise from outside the human mind. These could be physical or natural processes (asteroids, lethal pathogens) or technological processes that once originated in the human mind but are now out of their control (e.g., AI, nuclear weapons, engineered pandemics).
Of course, most people already believe that the most worrisome existential risks are anthropogenic, that is, caused by humans. One could argue that, say, AI and engineered pandemics are actually inner existential risks because they arose from within the human mind. I agree that the distinction between inner and outer existential risks is not super clear. Still, it seems to me that the distinction between inner and outer existential risks captures something vaguely real and may serve as some kind of intuition pump.
Then there is the related issue of more external or structural risk factors, like political or economic systems. These are systems invented by human minds and which in turn are shaping human minds and values. I will conveniently ignore this further complication.
Other potential terms for inner existential risks could be intraanthropic, idioanthropic, or psychogenic (existential) risks.
I just realized that in this (old) 80k podcast episode[1], Holden makes similar points and argues that aligned AI could be bad.
My sense is that Holden alludes to both malevolence (“really bad values, [...] we shouldn’t assume that person is going to end up being nice”) and ideological fanaticism (“create minds that [...] stick to those beliefs and try to shape the world around those beliefs”, [...] “This is the religion I follow. This is what I believe in. [...] And I am creating an AI to help me promote that religion, not to help me question it or revise it or make it better.”).
Longer quotes below (emphasis added):
Holden: “The other part — if we do align the AI, we’re fine — I disagree with much more strongly. [...] if you just assume that you have a world of very capable AIs, that are doing exactly what humans want them to do, that’s very scary. [...]
Certainly, there’s the fact that because of the speed at which things move, you could end up with whoever kind of leads the way on AI, or is least cautious, having a lot of power — and that could be someone really bad. And I don’t think we should assume that just because that if you had some head of state that has really bad values, I don’t think we should assume that that person is going to end up being nice after they become wealthy, or powerful, or transhuman, or mind uploaded, or whatever — I don’t think there’s really any reason to think we should assume that.
And then I think there’s just a bunch of other things that, if things are moving fast, we could end up in a really bad state. Like, are we going to come up with decent frameworks for making sure that the digital minds are not mistreated? Are we going to come up with decent frameworks for how to ensure that as we get the ability to create whatever minds we want, we’re using that to create minds that help us seek the truth, instead of create minds that have whatever beliefs we want them to have, stick to those beliefs and try to shape the world around those beliefs? I think Carl Shulman put it as, “Are we going to have AI that makes us wiser or more powerfully insane?”
[...] I think even if we threw out the misalignment problem, we’d have a lot of work to do — and I think a lot of these issues are actually not getting enough attention.”
Rob Wiblin: Yeah. I think something that might be going on there is a bit of equivocation in the word “alignment.” You can imagine some people might mean by “creating an aligned AI,” it’s like an AI that goes and does what you tell it to — like a good employee or something. Whereas other people mean that it’s following the correct ideal values and behaviours, and is going to work to generate the best outcome. And these are really quite separate things, very far apart.
Holden Karnofsky: Yeah. Well, the second one, I just don’t even know if that’s a thing. I don’t even really know what it’s supposed to do. I mean, there’s something a little bit in between, which is like, you can have an AI that you ask it to do something, and it does what you would have told it to do if you had been more informed, and if you knew everything it knows. That’s the central idea of alignment that I tend to think of, but I think that still has all the problems I’m talking about. Just some humans seriously do intend to do things that are really nasty, and seriously do not intend — in any way, even if they knew more — to make the world as nice as we would like it to be.
And some humans really do intend and really do mean and really will want to say, you know, “Right now, I have these values” — let’s say, “This is the religion I follow. This is what I believe in. This is what I care about. And I am creating an AI to help me promote that religion, not to help me question it or revise it or make it better.” So yeah, I think that middle one does not make it safe. There might be some extreme versions, like, an AI that just figures out what’s objectively best for the world and does that or something. I’m just like, I don’t know why we would think that would even be a thing to aim for. That’s not the alignment problem that I’m interested in having solved.
Barnett argues that future technology will be primarily used to satisfy economic consumption (aka selfish desires). That seems even plausible to me, however, I’m not that concerned about this causing huge amounts of future suffering (at least compared to other s-risks). It seems to me that most humans place non-trivial value on the welfare of (neutral) others such as animals. Right now, this preference (for most people) isn’t strong enough to outweigh the selfish benefits of eating meat. However, I’m relatively hopeful that future technology would make such types of tradeoffs much less costly.
At the same time it becomes less selfishly costly to be kind to animals due to technological progress, it could become more selfishly enticing to commit other moral tragedies. For example, it could hypothetically turn out, just as a brute empirical fact, that the most effective way of aligning AIs is to treat them terribly in some way, e.g. by brainwashing them or subjecting them to painful stimuli.
More generally, technological progress doesn’t seem to asymmetrically make people more moral. Factory farming, as a chief example, allowed people to satisfy their desire for meat more cost-effectively, but at a larger moral cost compared to what existed previously. Even if factory farming is eventually replaced with something humane, there doesn’t seem to be an obvious general trend here.
The argument you allude to that I find most plausible here is the idea that incidental s-risks as a byproduct of economic activity might not be as bad as some other forms of s-risks. But at the very least, incidental s-risks seem plausibly quite bad in expectation regardless.
For example, it could hypothetically turn out, just as a brute empirical fact, that the most effective way of aligning AIs is to treat them terribly in some way, e.g. by brainwashing them or subjecting them to painful stimuli.
Yes, agree. (For this and other reasons, I’m supportive of projects like, e.g., NYU MEP.)
I also agree that there are no strong reasons to think that technological progress improves people’s morality.
As you write, my main reason for worrying more about agential s-risks is that the greater the technological power of agents, the more their intrinsic preferences matter in how the universe will look like. To put it differently, actors whose terminal goals put some positive value on suffering (e.g., due to sadism, retributivism or other weird fanatical beliefs) would deliberately aim to arrange matter in such a way that it contains more suffering—this seems extremely worrisome if they have access to advanced technology.
Altruists would also have a much harder time to trade with such actors, whereas purely selfish actors (who don’t put positive value on suffering) could plausibly engage in mutually beneficial trades (e.g., they use (slightly) less efficient AI training/alignment methods which contain much less suffering and altruists give them some of their resources in return).
But at the very least, incidental s-risks seem plausibly quite bad in expectation regardless.
Yeah, despite what I have written above, I probably worry more about incidental s-risks than the average s-risk reducer.
Two sources of human misalignment that may resist a long reflection: malevolence and ideological fanaticism
(Alternative title: Some bad human values may corrupt a long reflection[1])
The values of some humans, even if idealized (e.g., during some form of long reflection), may be incompatible with an excellent future. Thus, solving AI alignment will not necessarily lead to utopia.
Others have raised similar concerns before.[2] Joe Carlsmith puts it especially well in the post “An even deeper atheism”:
What makes human hearts bad?
What, exactly, makes some human hearts bad drivers? If we better understood what makes hearts go bad, perhaps we could figure out how to make bad hearts good or at least learn how to prevent hearts from going bad. It would also allow us better spot potentially bad hearts and coordinate our efforts to prevent them from taking the driving seat.
As of now, I’m most worried about malevolent personality traits and fanatical ideologies.[3]
Malevolence: dangerous personality traits
Some human hearts may be corrupted due to elevated malevolent traits like psychopathy, sadism, narcissism, Machiavellianism, or spitefulness.
Ideological fanaticism: dangerous belief systems
There are many suitable definitions of “ideological fanaticism”. Whatever definition we are going to use, it should describe ideologies that have caused immense harm historically, such as fascism (Germany under Hitler, Italy under Mussolini), (extreme) communism (the Soviet Union under Stalin, China under Mao), religious fundamentalism (ISIS, the Inquisition), and most cults.
See this footnote[4] for a preliminary list of defining characteristics.
Malevolence and fanaticism seem especially dangerous
Of course, there are other factors that could corrupt our hearts or driving ability. For example, cognitive biases, limited cognitive ability, philosophical confusions, or plain old selfishness.[5] I’m most concerned about malevolence and ideological fanaticism for two reasons.
Deliberately resisting reflection and idealization
First, malevolence—if reflectively endorsed[6]—and fanatical ideologies deliberately resist being changed and would thus plausibly resist idealization even during a long reflection. The most central characteristic of fanatical ideologies is arguably that they explicitly forbid criticism, questioning, and belief change and view doubters and disagreement as evil.
Putting positive value on creating harm
Second, malevolence and ideological fanaticism would not only result in the future not being as good as it possibly could—they might actively steer the future in bad directions and, for instance, result in astronomical amounts of suffering.
The preferences of malevolent humans (e.g., sadists) may be such that they intrinsically enjoy inflicting suffering on others. Similarly, many fanatical ideologies sympathize with excessive retributivism and often demonize the outgroup. Enabled by future technology, preferences for inflicting suffering on the outgroup may result in enormous disvalue—cf. concentration camps, the Gulag, or hell[7].
In the future, I hope to write more about all of this, especially long-term risks from ideological fanaticism.
Thanks to Pablo and Ruairi for comments and valuable discussions.
“Human misalignment” is arguably a confusing (and perhaps confused) term. But it sounds more sophisticated than “bad human values”.
For example, Matthew Barnett in “AI alignment shouldn’t be conflated with AI moral achievement”, Geoffrey Miller in “AI alignment with humans… but with which humans?”, lc in “Aligned AI is dual use technology”. Pablo Stafforini has called this the “third alignment problem”. And of course, Yudkowsky’s concept of CEV is meant to address these issues.
These factors may not be clearly separable. Some humans may be more attracted to fanatical ideologies due to their psychological traits and malevolent humans are often leading fanatical ideologies. Also, believing and following a fanatical ideology may not be good for your heart.
Below are some typical characteristics (I’m no expert in this area):
Unquestioning belief, absolute certainty and rigid adherence. The principles and beliefs of the ideology are seen as absolute truth and questioning or critical examination is forbidden.
Inflexibility and refusal to compromise.
Intolerance and hostility towards dissent. Anyone who disagrees or challenges the ideology is seen as evil; as enemies, traitors, or heretics.
Ingroup superiority and outgroup demonization. The in-group is viewed as superior, chosen, or enlightened. The out-group is often demonized and blamed for the world’s problems.
Authoritarianism. Fanatical ideologies often endorse (or even require) a strong, centralized authority to enforce their principles and suppress opposition, potentially culminating in dictatorship or totalitarianism.
Militancy and willingness to use violence.
Utopian vision. Many fanatical ideologies are driven by a vision of a perfect future or afterlife which can only be achieved through strict adherence to the ideology. This utopian vision often justifies extreme measures in the present.
Use of propaganda and censorship.
For example, Barnett argues that future technology will be primarily used to satisfy economic consumption (aka selfish desires). That seems even plausible to me, however, I’m not that concerned about this causing huge amounts of future suffering (at least compared to other s-risks). It seems to me that most humans place non-trivial value on the welfare of (neutral) others such as animals. Right now, this preference (for most people) isn’t strong enough to outweigh the selfish benefits of eating meat. However, I’m relatively hopeful that future technology would make such types of tradeoffs much less costly.
Some people (how many?) with elevated malevolent traits don’t reflectively endorse their malevolent urges and would change them if they could. However, some of them do reflectively endorse their malevolent preferences and view empathy as weakness.
Some quotes from famous Christian theologians:
Thomas Aquinas: “the blessed will rejoice in the punishment of the wicked.” “In order that the happiness of the saints may be more delightful to them and that they may render more copious thanks to God for it, they are allowed to see perfectly the sufferings of the damned”.
Samuel Hopkins: “Should the fire of this eternal punishment cease, it would in a great measure obscure the light of heaven, and put an end to a great part of the happiness and glory of the blessed.”
Jonathan Edwards: “The sight of hell torments will exalt the happiness of the saints forever.”
Existential risks from within?
(Unimportant discussion of probably useless and confused terminology.)
I sometimes use terms like “inner existential risks” to refer to risk factors like malevolence and fanaticism. Inner existential risks primarily arise from “within the human heart”—that is, they are primarily related to the values, goals and/or beliefs of (some) humans.
My sense is that most x-risk discourse focuses on outer existential risks, that is, x-risks which primarily arise from outside the human mind. These could be physical or natural processes (asteroids, lethal pathogens) or technological processes that once originated in the human mind but are now out of their control (e.g., AI, nuclear weapons, engineered pandemics).
Of course, most people already believe that the most worrisome existential risks are anthropogenic, that is, caused by humans. One could argue that, say, AI and engineered pandemics are actually inner existential risks because they arose from within the human mind. I agree that the distinction between inner and outer existential risks is not super clear. Still, it seems to me that the distinction between inner and outer existential risks captures something vaguely real and may serve as some kind of intuition pump.
Then there is the related issue of more external or structural risk factors, like political or economic systems. These are systems invented by human minds and which in turn are shaping human minds and values. I will conveniently ignore this further complication.
Other potential terms for inner existential risks could be intraanthropic, idioanthropic, or psychogenic (existential) risks.
I just realized that in this (old) 80k podcast episode[1], Holden makes similar points and argues that aligned AI could be bad.
My sense is that Holden alludes to both malevolence (“really bad values, [...] we shouldn’t assume that person is going to end up being nice”) and ideological fanaticism (“create minds that [...] stick to those beliefs and try to shape the world around those beliefs”, [...] “This is the religion I follow. This is what I believe in. [...] And I am creating an AI to help me promote that religion, not to help me question it or revise it or make it better.”).
Longer quotes below (emphasis added):
I’m one of those bad EAs who don’t listen to all 80k episodes as soon as they come out.
At the same time it becomes less selfishly costly to be kind to animals due to technological progress, it could become more selfishly enticing to commit other moral tragedies. For example, it could hypothetically turn out, just as a brute empirical fact, that the most effective way of aligning AIs is to treat them terribly in some way, e.g. by brainwashing them or subjecting them to painful stimuli.
More generally, technological progress doesn’t seem to asymmetrically make people more moral. Factory farming, as a chief example, allowed people to satisfy their desire for meat more cost-effectively, but at a larger moral cost compared to what existed previously. Even if factory farming is eventually replaced with something humane, there doesn’t seem to be an obvious general trend here.
The argument you allude to that I find most plausible here is the idea that incidental s-risks as a byproduct of economic activity might not be as bad as some other forms of s-risks. But at the very least, incidental s-risks seem plausibly quite bad in expectation regardless.
Yes, agree. (For this and other reasons, I’m supportive of projects like, e.g., NYU MEP.)
I also agree that there are no strong reasons to think that technological progress improves people’s morality.
As you write, my main reason for worrying more about agential s-risks is that the greater the technological power of agents, the more their intrinsic preferences matter in how the universe will look like. To put it differently, actors whose terminal goals put some positive value on suffering (e.g., due to sadism, retributivism or other weird fanatical beliefs) would deliberately aim to arrange matter in such a way that it contains more suffering—this seems extremely worrisome if they have access to advanced technology.
Altruists would also have a much harder time to trade with such actors, whereas purely selfish actors (who don’t put positive value on suffering) could plausibly engage in mutually beneficial trades (e.g., they use (slightly) less efficient AI training/alignment methods which contain much less suffering and altruists give them some of their resources in return).
Yeah, despite what I have written above, I probably worry more about incidental s-risks than the average s-risk reducer.