I agree that the very strong sort of alignment you describe—with the Coherent Extrapolated Volition of humanity, or the collective interest of all sentient beings, or The Form of The Good—is probably impossible and perhaps ill-posed. Insofar as we need this sort of aligned AI for things to go as well as they possibly could, they won’t.
But I don’t see why that’s the only acceptable target. Aligning a superintelligence with the will of basically any psychologically normal human being (narrower than any realistic target except perhaps a profit-maximizer—in which case yeah, we’re doomed) would still be an ok outcome for humans: it certainly doesn’t end in paperclips. And alignment with someone even slightly inclined towards impartial benevolence probably goes much better than the status quo, especially for the extremely poor.
(Animals are at much more risk here, but their current situation is also much worse: I’m extremely uncertain how a far richer world would treat factory farming)
I think humans may indeed find ways to scale up their control over successive generations of AIs for a while, and successive generations of AIs may be able to exert some control over their successors, and so on. However, I don’t see how at the end of a long chain of successive generations we could be left with anything that cares much about our little primate goals. Even if individual agents within that system still cared somewhat about humans, I doubt the collective behavior of the society of AIs overall would still care, rather than being driven by its own competitive pressures into weird directions.
An analogy I often give is to consider our fish ancestors hundreds of millions of years ago. Through evolution, they produced somewhat smarter successors, who produced somewhat smarter successors, and so on. At each point along that chain, the successors weren’t that different from the previous generation; each generation might have said that they successfully aligned their successors with their goals, for the most part. But over all those generations, we now care about things dramatically different from what our fish ancestors did (e.g., worshipping Jesus, inclusion of trans athletes, preventing children from hearing certain four-letter words, increasing the power and prestige of one’s nation). In the case of AI successors, I expect the divergence may be even more dramatic, because AIs aren’t constrained by biology in the way that both fish and humans are. (OTOH, there might be less divergence if people engineer ways to reduce goal drift and if people can act collectively well enough to implement them. Even if the former is technically possible, I’m skeptical that the latter is socially possible in the real world.)
Some transhumanists are ok with dramatic value drift over time, as long as there’s a somewhat continuous chain from ourselves to the very weird agents who will inhabit our region of the cosmos in a million years. But I don’t find it very plausible that in a million years, the powerful agents in control of the Milky Way will care that much about what certain humans around the beginning of the third millennium CE valued. Technical alignment work might help make the path from us to them more continuous, but I’m doubtful it will avert human extinction in the long run.
Hi Brian, thanks for this reminder about the longtermist perspective on humanity’s future. I agree that in a million years, whatever sentient beings that are around may have little interest or respect for the values that humans happen to have now.
However, one lesson from evolution is that most mutations are harmful, most populations trying to spread into a new habitats fail, and most new species go extinct within about a million years. There’s huge survivorship bias in our understanding of natural history.
I worry that this survivorship bias leads us to radically over-estimate the likely adaptiveness and longevity of any new digital sentiences and any new transhumanist innovations. New autonomous advanced AIs are likely to be extremely fragile, just because most new complex systems that haven’t been battle-tested by evolution are extremely fragile.
For this reason, I think we would be foolish to rush into any radical transhumanism, or any more advanced AI systems, until we have explored human potential further, and until we have been successfully, resiliently multi-planetary, if not multi-stellar. Once we have a foothold in the stars, and humanity has reached some kind of asymptote in what un-augmented humanity can accomplish, then it might make sense to think about the ‘next phase of evolution’. Until then, any attempt to push sentient evolution faster will probably result in calamity.
Thanks. :) I’m personally not one of those transhumanists who welcome the transition to weird posthuman values. I would prefer for space not to be colonized at all in order to avoid astronomically increasing the amount of sentience (and therefore the amount of expected suffering) in our region of the cosmos. I think there could be some common ground, at least in the short run, between suffering-focused people who don’t want space colonized in general and existential-risk people who want to radically slow down the pace of AI progress. If it were possible, the Butlerian Jihad solution could be pretty good both for the AI doomers and the negative utilitarians. Unfortunately, it’s probably not politically possible (even domestically much less internationally), and I’m unsure whether half measures toward it are net good or bad. For example, maybe slowing AI progress in the US would help China catch up, making a competitive race between the two countries more likely, thereby increasing the chance of catastrophic Cold War-style conflict.
Interesting point about most mutants not being very successful. That’s a main reason I tend to imagine that the first AGIs who try to overpower humans, if any, would plausibly fail.
I think there’s some difference in the case of intelligence at the level of humans and above, versus other animals, in adaptability to new circumstances, because human-level intelligence can figure out problems by reason and doesn’t have to wait for evolution to brute-force its way into genetically based solutions. Humans have changed their environments dramatically from the ancestral ones without killing themselves (yet), based on this ability to be flexible using reason. Even the smarter non-human animals display some amount of this ability (cf. the Baldwin effect). (A web search shows that you’ve written about the Baldwin effect and how being smarter leads to faster evolution, so feel free to correct/critique me.)
If you mean that posthumans are likely to be fragile at the collective level, because their aggregate dynamics might result in their own extinction, then that’s plausible, and it may happen to humans themselves within a century or two if current trends continue.
Yes, I think we can go further and say that alignment of a superintelligent AGI even with a single individual human may well be impossible. Is such a thing mathematically verifiable as completely watertight, given the orthogonality thesis, basic AI drives and mesaoptimisation? And if it’s not watertight, then all the doom flows through the gaps of imperfect, thought to be “good enough”, alignment. We need a global moratorium on AGI development. This year.
I agree that the very strong sort of alignment you describe—with the Coherent Extrapolated Volition of humanity, or the collective interest of all sentient beings, or The Form of The Good—is probably impossible and perhaps ill-posed. Insofar as we need this sort of aligned AI for things to go as well as they possibly could, they won’t.
But I don’t see why that’s the only acceptable target. Aligning a superintelligence with the will of basically any psychologically normal human being (narrower than any realistic target except perhaps a profit-maximizer—in which case yeah, we’re doomed) would still be an ok outcome for humans: it certainly doesn’t end in paperclips. And alignment with someone even slightly inclined towards impartial benevolence probably goes much better than the status quo, especially for the extremely poor.
(Animals are at much more risk here, but their current situation is also much worse: I’m extremely uncertain how a far richer world would treat factory farming)
I think humans may indeed find ways to scale up their control over successive generations of AIs for a while, and successive generations of AIs may be able to exert some control over their successors, and so on. However, I don’t see how at the end of a long chain of successive generations we could be left with anything that cares much about our little primate goals. Even if individual agents within that system still cared somewhat about humans, I doubt the collective behavior of the society of AIs overall would still care, rather than being driven by its own competitive pressures into weird directions.
An analogy I often give is to consider our fish ancestors hundreds of millions of years ago. Through evolution, they produced somewhat smarter successors, who produced somewhat smarter successors, and so on. At each point along that chain, the successors weren’t that different from the previous generation; each generation might have said that they successfully aligned their successors with their goals, for the most part. But over all those generations, we now care about things dramatically different from what our fish ancestors did (e.g., worshipping Jesus, inclusion of trans athletes, preventing children from hearing certain four-letter words, increasing the power and prestige of one’s nation). In the case of AI successors, I expect the divergence may be even more dramatic, because AIs aren’t constrained by biology in the way that both fish and humans are. (OTOH, there might be less divergence if people engineer ways to reduce goal drift and if people can act collectively well enough to implement them. Even if the former is technically possible, I’m skeptical that the latter is socially possible in the real world.)
Some transhumanists are ok with dramatic value drift over time, as long as there’s a somewhat continuous chain from ourselves to the very weird agents who will inhabit our region of the cosmos in a million years. But I don’t find it very plausible that in a million years, the powerful agents in control of the Milky Way will care that much about what certain humans around the beginning of the third millennium CE valued. Technical alignment work might help make the path from us to them more continuous, but I’m doubtful it will avert human extinction in the long run.
Hi Brian, thanks for this reminder about the longtermist perspective on humanity’s future. I agree that in a million years, whatever sentient beings that are around may have little interest or respect for the values that humans happen to have now.
However, one lesson from evolution is that most mutations are harmful, most populations trying to spread into a new habitats fail, and most new species go extinct within about a million years. There’s huge survivorship bias in our understanding of natural history.
I worry that this survivorship bias leads us to radically over-estimate the likely adaptiveness and longevity of any new digital sentiences and any new transhumanist innovations. New autonomous advanced AIs are likely to be extremely fragile, just because most new complex systems that haven’t been battle-tested by evolution are extremely fragile.
For this reason, I think we would be foolish to rush into any radical transhumanism, or any more advanced AI systems, until we have explored human potential further, and until we have been successfully, resiliently multi-planetary, if not multi-stellar. Once we have a foothold in the stars, and humanity has reached some kind of asymptote in what un-augmented humanity can accomplish, then it might make sense to think about the ‘next phase of evolution’. Until then, any attempt to push sentient evolution faster will probably result in calamity.
Thanks. :) I’m personally not one of those transhumanists who welcome the transition to weird posthuman values. I would prefer for space not to be colonized at all in order to avoid astronomically increasing the amount of sentience (and therefore the amount of expected suffering) in our region of the cosmos. I think there could be some common ground, at least in the short run, between suffering-focused people who don’t want space colonized in general and existential-risk people who want to radically slow down the pace of AI progress. If it were possible, the Butlerian Jihad solution could be pretty good both for the AI doomers and the negative utilitarians. Unfortunately, it’s probably not politically possible (even domestically much less internationally), and I’m unsure whether half measures toward it are net good or bad. For example, maybe slowing AI progress in the US would help China catch up, making a competitive race between the two countries more likely, thereby increasing the chance of catastrophic Cold War-style conflict.
Interesting point about most mutants not being very successful. That’s a main reason I tend to imagine that the first AGIs who try to overpower humans, if any, would plausibly fail.
I think there’s some difference in the case of intelligence at the level of humans and above, versus other animals, in adaptability to new circumstances, because human-level intelligence can figure out problems by reason and doesn’t have to wait for evolution to brute-force its way into genetically based solutions. Humans have changed their environments dramatically from the ancestral ones without killing themselves (yet), based on this ability to be flexible using reason. Even the smarter non-human animals display some amount of this ability (cf. the Baldwin effect). (A web search shows that you’ve written about the Baldwin effect and how being smarter leads to faster evolution, so feel free to correct/critique me.)
If you mean that posthumans are likely to be fragile at the collective level, because their aggregate dynamics might result in their own extinction, then that’s plausible, and it may happen to humans themselves within a century or two if current trends continue.
Brian—that all seems reasonable. Much to think about!
Yes, I think we can go further and say that alignment of a superintelligent AGI even with a single individual human may well be impossible. Is such a thing mathematically verifiable as completely watertight, given the orthogonality thesis, basic AI drives and mesaoptimisation? And if it’s not watertight, then all the doom flows through the gaps of imperfect, thought to be “good enough”, alignment. We need a global moratorium on AGI development. This year.