Leopold—thanks for a clear, vivid, candid, and galavanizing post. I agree with about 80% of it.
However, I don’t agree with your central premise that alignment is solvable. We want it to be solvable. We believe that we need it to be solvable (or else, God forbid, we might have to actually stop AI development for a few decades or centuries).
But that doesn’t mean it is solvable. And we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values (which I’ve written about in other EA Forum posts, and elsewhere), (2) given the deep game-theoretic conflicts between human individuals, groups, companies, and nation-states (which cannot be waved away by invoking Coherent Extrapolated Volition, or ‘dontkilleveryoneism’, or any other notion that sweeps people’s profoundly divergent interests under the carpet), and (3) given that humans are not the only sentient stakeholder species that AI would need to be aligned with (advanced AI will have implications for every other of the 65,000 vertebrate species on Earth, and most of the 1,000,000+ invertebrate species, one way or another).
Human individuals aren’t aligned with each other. Companies aren’t aligned with each other. Nation-states aren’t aligned with each other. Other animal species aren’t aligned with humans, or with each other. There is no reason to expect that any AI systems could be ‘aligned’ with the totality of other sentient life on Earth. Our Bayesian prior, based on the simple fact that different sentient beings have different interests, values, goals, and preferences, must be that AI alignment with ‘humanity in general’, or ‘sentient life in general’, is simply not possible. Sad, but true.
I worry that ‘AI alignment’ as a concept, or narrative, or aspiration, is just promising enough that it encourages the AI industry to charge full steam ahead (in hopes that alignment will be ‘solved’ before AI advances to much more dangerous capabilities), but it is not delivering nearly enough workable solutions to make their reckless accelerationism safe. We are getting the worst of both worlds—a credible illusion of a path towards safety, without any actual increase in safety.
In other words, the assumption that ‘alignment is solvable’ might be a very dangerous X-risk amplifier, in its own right. It emboldens the AI industry to accelerate. It gives EAs (probably) false hope that some clever technical solution can make humans all aligned with each other, and make machine intelligences aligned with organic intelligences. It gives ordinary citizens, politicians, regulators, and journalists the impression that some very smart people are working very hard on making AI safe, in ways that will probably work. It may be leading China to assume that some clever Americans are already handling all those thorny X-risk issues, such that China doesn’t really need to duplicate those ongoing AI safety efforts, and will be able to just copy our alignment solutions once we get them.
If we take seriously the possibility that alignment might not be solvable, we need to rethink our whole EA strategy for reducing AI X-risk. This might entail EAs putting a much stronger emphasis on slowing or stopping further AI development, at least for a while. We are continually told that ‘AI is inevitable’, ‘the genie is out of the bottle’, ‘regulation won’t work’, etc. I think too many of us buy into the over-pessimistic view that there’s absolutely nothing we can do to stop AI development, while also buying into the over-optimistic view that alignment is possible—if we just recruit more talent, work a little more, get a few more grants, think really hard, etc.
I think we should reverse these optimisms and pessimisms. We need to rediscover some optimism that the 8 billion people on Earth can pause, slow, handicap, or stop AI development by the 100,000 or so AI researchers, devs, and entrepreneurs that are driving us straight into a Great Filter. But we need to rediscover some pessimism about the concept of ‘AI alignment’ itself.
In my view, the burden of proof should be on those who think that ‘AI alignment with human values in general’ is a solvable problem. I have seen no coherent argument that it is solvable. I’ve just seen people desperate to believe that it is solvable. But that’s mostly because the alternative seems so alarming, i.e., the idea that (1) the AI industry is increasingly imposing existential risks on us all, (2) it has a lot of money, power, talent, influence, and hubris, (3) it will not slow down unless we make it slow down, and (4) slowing it down will require EAs to shift to a whole different set of strategies, tactics, priorities, and mind-sets than we had been developing within the ‘alignment’ paradigm.
I agree that the very strong sort of alignment you describe—with the Coherent Extrapolated Volition of humanity, or the collective interest of all sentient beings, or The Form of The Good—is probably impossible and perhaps ill-posed. Insofar as we need this sort of aligned AI for things to go as well as they possibly could, they won’t.
But I don’t see why that’s the only acceptable target. Aligning a superintelligence with the will of basically any psychologically normal human being (narrower than any realistic target except perhaps a profit-maximizer—in which case yeah, we’re doomed) would still be an ok outcome for humans: it certainly doesn’t end in paperclips. And alignment with someone even slightly inclined towards impartial benevolence probably goes much better than the status quo, especially for the extremely poor.
(Animals are at much more risk here, but their current situation is also much worse: I’m extremely uncertain how a far richer world would treat factory farming)
I think humans may indeed find ways to scale up their control over successive generations of AIs for a while, and successive generations of AIs may be able to exert some control over their successors, and so on. However, I don’t see how at the end of a long chain of successive generations we could be left with anything that cares much about our little primate goals. Even if individual agents within that system still cared somewhat about humans, I doubt the collective behavior of the society of AIs overall would still care, rather than being driven by its own competitive pressures into weird directions.
An analogy I often give is to consider our fish ancestors hundreds of millions of years ago. Through evolution, they produced somewhat smarter successors, who produced somewhat smarter successors, and so on. At each point along that chain, the successors weren’t that different from the previous generation; each generation might have said that they successfully aligned their successors with their goals, for the most part. But over all those generations, we now care about things dramatically different from what our fish ancestors did (e.g., worshipping Jesus, inclusion of trans athletes, preventing children from hearing certain four-letter words, increasing the power and prestige of one’s nation). In the case of AI successors, I expect the divergence may be even more dramatic, because AIs aren’t constrained by biology in the way that both fish and humans are. (OTOH, there might be less divergence if people engineer ways to reduce goal drift and if people can act collectively well enough to implement them. Even if the former is technically possible, I’m skeptical that the latter is socially possible in the real world.)
Some transhumanists are ok with dramatic value drift over time, as long as there’s a somewhat continuous chain from ourselves to the very weird agents who will inhabit our region of the cosmos in a million years. But I don’t find it very plausible that in a million years, the powerful agents in control of the Milky Way will care that much about what certain humans around the beginning of the third millennium CE valued. Technical alignment work might help make the path from us to them more continuous, but I’m doubtful it will avert human extinction in the long run.
Hi Brian, thanks for this reminder about the longtermist perspective on humanity’s future. I agree that in a million years, whatever sentient beings that are around may have little interest or respect for the values that humans happen to have now.
However, one lesson from evolution is that most mutations are harmful, most populations trying to spread into a new habitats fail, and most new species go extinct within about a million years. There’s huge survivorship bias in our understanding of natural history.
I worry that this survivorship bias leads us to radically over-estimate the likely adaptiveness and longevity of any new digital sentiences and any new transhumanist innovations. New autonomous advanced AIs are likely to be extremely fragile, just because most new complex systems that haven’t been battle-tested by evolution are extremely fragile.
For this reason, I think we would be foolish to rush into any radical transhumanism, or any more advanced AI systems, until we have explored human potential further, and until we have been successfully, resiliently multi-planetary, if not multi-stellar. Once we have a foothold in the stars, and humanity has reached some kind of asymptote in what un-augmented humanity can accomplish, then it might make sense to think about the ‘next phase of evolution’. Until then, any attempt to push sentient evolution faster will probably result in calamity.
Thanks. :) I’m personally not one of those transhumanists who welcome the transition to weird posthuman values. I would prefer for space not to be colonized at all in order to avoid astronomically increasing the amount of sentience (and therefore the amount of expected suffering) in our region of the cosmos. I think there could be some common ground, at least in the short run, between suffering-focused people who don’t want space colonized in general and existential-risk people who want to radically slow down the pace of AI progress. If it were possible, the Butlerian Jihad solution could be pretty good both for the AI doomers and the negative utilitarians. Unfortunately, it’s probably not politically possible (even domestically much less internationally), and I’m unsure whether half measures toward it are net good or bad. For example, maybe slowing AI progress in the US would help China catch up, making a competitive race between the two countries more likely, thereby increasing the chance of catastrophic Cold War-style conflict.
Interesting point about most mutants not being very successful. That’s a main reason I tend to imagine that the first AGIs who try to overpower humans, if any, would plausibly fail.
I think there’s some difference in the case of intelligence at the level of humans and above, versus other animals, in adaptability to new circumstances, because human-level intelligence can figure out problems by reason and doesn’t have to wait for evolution to brute-force its way into genetically based solutions. Humans have changed their environments dramatically from the ancestral ones without killing themselves (yet), based on this ability to be flexible using reason. Even the smarter non-human animals display some amount of this ability (cf. the Baldwin effect). (A web search shows that you’ve written about the Baldwin effect and how being smarter leads to faster evolution, so feel free to correct/critique me.)
If you mean that posthumans are likely to be fragile at the collective level, because their aggregate dynamics might result in their own extinction, then that’s plausible, and it may happen to humans themselves within a century or two if current trends continue.
Yes, I think we can go further and say that alignment of a superintelligent AGI even with a single individual human may well be impossible. Is such a thing mathematically verifiable as completely watertight, given the orthogonality thesis, basic AI drives and mesaoptimisation? And if it’s not watertight, then all the doom flows through the gaps of imperfect, thought to be “good enough”, alignment. We need a global moratorium on AGI development. This year.
...we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values… There is no reason to expect that any AI systems could be ‘aligned’ with the totality of other sentient life on Earth.
One way to decompose the alignment question is into 2 parts:
Can we align it with human values? (the blockquote is an example of this)
Folks at e.g. MIRI think (1) is the hard problem and (2) isn’t as hard; folks like you think the opposite. Then you all talk past each other. (“You” isn’t aimed at literally you in particular, I’m summarizing what I’ve seen.) I don’t have a clear stance on which is harder; I just wish folks would engage with the best arguments from each side.
Mo—you might be right about what MIRI thinks will be hard. I’m not sure; it often seems difficult to understand what they write about these issues, since it’s often very abstract and seems not very grounded in specific goals and values that AIs might need to implement. I do think the MIRI-type approach radically under-estimates the difficulty of your point number 2.
On the other hand, I’m not at all confident that point number 1 will be easy. My hunch is that both 1 and 2 will prove surprisingly hard. Which is a good reason to pause AI research until we make a lot more progress on both issues. (And if we don’t make dramatic progress on both issues, the ‘pause’ should remain in place as long as it takes. Which could be decades or centuries.)
I’ve been thinking about this very thing for quite some time, and have been thinking up a concrete interventions to help the ML community / industry grasp this. DM me if you’re interested to discuss further.
Singular intelligence isn’t alignable; super intelligence as being generally like 3x smarter than all humanity very likely can be solved well and throughly. The great filter is only a theory and honestly quite a weak one given our ability to accurately assess planets outside our solar system for life is basically zero. As a rule I can’t take anyone serious when it comes to “projections” about what ASI does, from anyone without a scientifically complete and measurable definition of generalized intelligence.
Here’s our scientific definition:
We define generalization in the context of intelligence, as the ability to generate learned differentiation of subsystem components, then manipulate, and build relationships towards greater systems level understanding of the universal construct that governs the reality. This is not possible if physics weren’t universal for feedback to be derived. Zeusfyi, Inc is the only institution that has scientifically defined intelligence generalization. The purest test for generalization ability; create a construct with systemic rules that define all possible outcomes allowed; greater ability to predict more actions on first try over time; shows greater generalization; with >1 construct; ability to do same; relative to others.
Regarding the analogy you use where humans etc not being aligned with each other implying that human-machine alignment is equally hard: Humans are in competition with other humans. Nation-states are in competition with other nation-states. However AI algorithms are created by humans as a tool (at least, for now that seems to be the intention). Not to say this is an argument to think alignment is possible but I do think this is a flawed analogy.
Leopold—thanks for a clear, vivid, candid, and galavanizing post. I agree with about 80% of it.
However, I don’t agree with your central premise that alignment is solvable. We want it to be solvable. We believe that we need it to be solvable (or else, God forbid, we might have to actually stop AI development for a few decades or centuries).
But that doesn’t mean it is solvable. And we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values (which I’ve written about in other EA Forum posts, and elsewhere), (2) given the deep game-theoretic conflicts between human individuals, groups, companies, and nation-states (which cannot be waved away by invoking Coherent Extrapolated Volition, or ‘dontkilleveryoneism’, or any other notion that sweeps people’s profoundly divergent interests under the carpet), and (3) given that humans are not the only sentient stakeholder species that AI would need to be aligned with (advanced AI will have implications for every other of the 65,000 vertebrate species on Earth, and most of the 1,000,000+ invertebrate species, one way or another).
Human individuals aren’t aligned with each other. Companies aren’t aligned with each other. Nation-states aren’t aligned with each other. Other animal species aren’t aligned with humans, or with each other. There is no reason to expect that any AI systems could be ‘aligned’ with the totality of other sentient life on Earth. Our Bayesian prior, based on the simple fact that different sentient beings have different interests, values, goals, and preferences, must be that AI alignment with ‘humanity in general’, or ‘sentient life in general’, is simply not possible. Sad, but true.
I worry that ‘AI alignment’ as a concept, or narrative, or aspiration, is just promising enough that it encourages the AI industry to charge full steam ahead (in hopes that alignment will be ‘solved’ before AI advances to much more dangerous capabilities), but it is not delivering nearly enough workable solutions to make their reckless accelerationism safe. We are getting the worst of both worlds—a credible illusion of a path towards safety, without any actual increase in safety.
In other words, the assumption that ‘alignment is solvable’ might be a very dangerous X-risk amplifier, in its own right. It emboldens the AI industry to accelerate. It gives EAs (probably) false hope that some clever technical solution can make humans all aligned with each other, and make machine intelligences aligned with organic intelligences. It gives ordinary citizens, politicians, regulators, and journalists the impression that some very smart people are working very hard on making AI safe, in ways that will probably work. It may be leading China to assume that some clever Americans are already handling all those thorny X-risk issues, such that China doesn’t really need to duplicate those ongoing AI safety efforts, and will be able to just copy our alignment solutions once we get them.
If we take seriously the possibility that alignment might not be solvable, we need to rethink our whole EA strategy for reducing AI X-risk. This might entail EAs putting a much stronger emphasis on slowing or stopping further AI development, at least for a while. We are continually told that ‘AI is inevitable’, ‘the genie is out of the bottle’, ‘regulation won’t work’, etc. I think too many of us buy into the over-pessimistic view that there’s absolutely nothing we can do to stop AI development, while also buying into the over-optimistic view that alignment is possible—if we just recruit more talent, work a little more, get a few more grants, think really hard, etc.
I think we should reverse these optimisms and pessimisms. We need to rediscover some optimism that the 8 billion people on Earth can pause, slow, handicap, or stop AI development by the 100,000 or so AI researchers, devs, and entrepreneurs that are driving us straight into a Great Filter. But we need to rediscover some pessimism about the concept of ‘AI alignment’ itself.
In my view, the burden of proof should be on those who think that ‘AI alignment with human values in general’ is a solvable problem. I have seen no coherent argument that it is solvable. I’ve just seen people desperate to believe that it is solvable. But that’s mostly because the alternative seems so alarming, i.e., the idea that (1) the AI industry is increasingly imposing existential risks on us all, (2) it has a lot of money, power, talent, influence, and hubris, (3) it will not slow down unless we make it slow down, and (4) slowing it down will require EAs to shift to a whole different set of strategies, tactics, priorities, and mind-sets than we had been developing within the ‘alignment’ paradigm.
I agree that the very strong sort of alignment you describe—with the Coherent Extrapolated Volition of humanity, or the collective interest of all sentient beings, or The Form of The Good—is probably impossible and perhaps ill-posed. Insofar as we need this sort of aligned AI for things to go as well as they possibly could, they won’t.
But I don’t see why that’s the only acceptable target. Aligning a superintelligence with the will of basically any psychologically normal human being (narrower than any realistic target except perhaps a profit-maximizer—in which case yeah, we’re doomed) would still be an ok outcome for humans: it certainly doesn’t end in paperclips. And alignment with someone even slightly inclined towards impartial benevolence probably goes much better than the status quo, especially for the extremely poor.
(Animals are at much more risk here, but their current situation is also much worse: I’m extremely uncertain how a far richer world would treat factory farming)
I think humans may indeed find ways to scale up their control over successive generations of AIs for a while, and successive generations of AIs may be able to exert some control over their successors, and so on. However, I don’t see how at the end of a long chain of successive generations we could be left with anything that cares much about our little primate goals. Even if individual agents within that system still cared somewhat about humans, I doubt the collective behavior of the society of AIs overall would still care, rather than being driven by its own competitive pressures into weird directions.
An analogy I often give is to consider our fish ancestors hundreds of millions of years ago. Through evolution, they produced somewhat smarter successors, who produced somewhat smarter successors, and so on. At each point along that chain, the successors weren’t that different from the previous generation; each generation might have said that they successfully aligned their successors with their goals, for the most part. But over all those generations, we now care about things dramatically different from what our fish ancestors did (e.g., worshipping Jesus, inclusion of trans athletes, preventing children from hearing certain four-letter words, increasing the power and prestige of one’s nation). In the case of AI successors, I expect the divergence may be even more dramatic, because AIs aren’t constrained by biology in the way that both fish and humans are. (OTOH, there might be less divergence if people engineer ways to reduce goal drift and if people can act collectively well enough to implement them. Even if the former is technically possible, I’m skeptical that the latter is socially possible in the real world.)
Some transhumanists are ok with dramatic value drift over time, as long as there’s a somewhat continuous chain from ourselves to the very weird agents who will inhabit our region of the cosmos in a million years. But I don’t find it very plausible that in a million years, the powerful agents in control of the Milky Way will care that much about what certain humans around the beginning of the third millennium CE valued. Technical alignment work might help make the path from us to them more continuous, but I’m doubtful it will avert human extinction in the long run.
Hi Brian, thanks for this reminder about the longtermist perspective on humanity’s future. I agree that in a million years, whatever sentient beings that are around may have little interest or respect for the values that humans happen to have now.
However, one lesson from evolution is that most mutations are harmful, most populations trying to spread into a new habitats fail, and most new species go extinct within about a million years. There’s huge survivorship bias in our understanding of natural history.
I worry that this survivorship bias leads us to radically over-estimate the likely adaptiveness and longevity of any new digital sentiences and any new transhumanist innovations. New autonomous advanced AIs are likely to be extremely fragile, just because most new complex systems that haven’t been battle-tested by evolution are extremely fragile.
For this reason, I think we would be foolish to rush into any radical transhumanism, or any more advanced AI systems, until we have explored human potential further, and until we have been successfully, resiliently multi-planetary, if not multi-stellar. Once we have a foothold in the stars, and humanity has reached some kind of asymptote in what un-augmented humanity can accomplish, then it might make sense to think about the ‘next phase of evolution’. Until then, any attempt to push sentient evolution faster will probably result in calamity.
Thanks. :) I’m personally not one of those transhumanists who welcome the transition to weird posthuman values. I would prefer for space not to be colonized at all in order to avoid astronomically increasing the amount of sentience (and therefore the amount of expected suffering) in our region of the cosmos. I think there could be some common ground, at least in the short run, between suffering-focused people who don’t want space colonized in general and existential-risk people who want to radically slow down the pace of AI progress. If it were possible, the Butlerian Jihad solution could be pretty good both for the AI doomers and the negative utilitarians. Unfortunately, it’s probably not politically possible (even domestically much less internationally), and I’m unsure whether half measures toward it are net good or bad. For example, maybe slowing AI progress in the US would help China catch up, making a competitive race between the two countries more likely, thereby increasing the chance of catastrophic Cold War-style conflict.
Interesting point about most mutants not being very successful. That’s a main reason I tend to imagine that the first AGIs who try to overpower humans, if any, would plausibly fail.
I think there’s some difference in the case of intelligence at the level of humans and above, versus other animals, in adaptability to new circumstances, because human-level intelligence can figure out problems by reason and doesn’t have to wait for evolution to brute-force its way into genetically based solutions. Humans have changed their environments dramatically from the ancestral ones without killing themselves (yet), based on this ability to be flexible using reason. Even the smarter non-human animals display some amount of this ability (cf. the Baldwin effect). (A web search shows that you’ve written about the Baldwin effect and how being smarter leads to faster evolution, so feel free to correct/critique me.)
If you mean that posthumans are likely to be fragile at the collective level, because their aggregate dynamics might result in their own extinction, then that’s plausible, and it may happen to humans themselves within a century or two if current trends continue.
Brian—that all seems reasonable. Much to think about!
Yes, I think we can go further and say that alignment of a superintelligent AGI even with a single individual human may well be impossible. Is such a thing mathematically verifiable as completely watertight, given the orthogonality thesis, basic AI drives and mesaoptimisation? And if it’s not watertight, then all the doom flows through the gaps of imperfect, thought to be “good enough”, alignment. We need a global moratorium on AGI development. This year.
One way to decompose the alignment question is into 2 parts:
Can we aim ASI at all? (e.g. Nate Soares’ What I mean by “alignment is in large part about making cognition aimable at all”)
Can we align it with human values? (the blockquote is an example of this)
Folks at e.g. MIRI think (1) is the hard problem and (2) isn’t as hard; folks like you think the opposite. Then you all talk past each other. (“You” isn’t aimed at literally you in particular, I’m summarizing what I’ve seen.) I don’t have a clear stance on which is harder; I just wish folks would engage with the best arguments from each side.
Mo—you might be right about what MIRI thinks will be hard. I’m not sure; it often seems difficult to understand what they write about these issues, since it’s often very abstract and seems not very grounded in specific goals and values that AIs might need to implement. I do think the MIRI-type approach radically under-estimates the difficulty of your point number 2.
On the other hand, I’m not at all confident that point number 1 will be easy. My hunch is that both 1 and 2 will prove surprisingly hard. Which is a good reason to pause AI research until we make a lot more progress on both issues. (And if we don’t make dramatic progress on both issues, the ‘pause’ should remain in place as long as it takes. Which could be decades or centuries.)
I’ve been thinking about this very thing for quite some time, and have been thinking up a concrete interventions to help the ML community / industry grasp this. DM me if you’re interested to discuss further.
Singular intelligence isn’t alignable; super intelligence as being generally like 3x smarter than all humanity very likely can be solved well and throughly. The great filter is only a theory and honestly quite a weak one given our ability to accurately assess planets outside our solar system for life is basically zero. As a rule I can’t take anyone serious when it comes to “projections” about what ASI does, from anyone without a scientifically complete and measurable definition of generalized intelligence.
Here’s our scientific definition:
We define generalization in the context of intelligence, as the ability to generate learned differentiation of subsystem components, then manipulate, and build relationships towards greater systems level understanding of the universal construct that governs the reality. This is not possible if physics weren’t universal for feedback to be derived. Zeusfyi, Inc is the only institution that has scientifically defined intelligence generalization. The purest test for generalization ability; create a construct with systemic rules that define all possible outcomes allowed; greater ability to predict more actions on first try over time; shows greater generalization; with >1 construct; ability to do same; relative to others.
Regarding the analogy you use where humans etc not being aligned with each other implying that human-machine alignment is equally hard: Humans are in competition with other humans. Nation-states are in competition with other nation-states. However AI algorithms are created by humans as a tool (at least, for now that seems to be the intention). Not to say this is an argument to think alignment is possible but I do think this is a flawed analogy.