There are a great many forces shaping the evolution of the universe. Among them, the values of agents—systems which attempt to optimize, or steer the future towards certain configurations over others—seem likely to have a dominant influence on the long-term future. The values of the agents around now have been largely determined by competitive pressures. Many people in the rationalist/EA community seem to take it for granted that this is soon going to change, and we will enter an era in which values and competition are completely decoupled; the values of the beings around at the time of this decoupling will be “locked in” and determine the shape of the entire future. I think is it plausible(>30% probability) that they are wrong, and that competition will continue, with at least some strength, indefinitely. If this is true, it has major implications for the likely trajectory of the world and how we should go about influencing the long-term future. In this blog post I’ll lay out why I think this and what the implications are.
Epistemic status: not confident that the thesis is correct; I am confident that the community should be allocating more probability mass to this scenario than they currently are. If you like, imagine prepending every statement with “there is at least a 30% probability that”.
SUMMARY
I sketch three possible scenarios for what the value systems of machine intelligences might look like. In two of these scenarios, values and competition are totally decoupled; in the third, they remain partially coupled.
I present the most basic arguments for and against the occurrence of decoupling. Briefly, the difficulty of ensuring successor alignment might generate competitive pressure towards value systems that try to accrue power to their successors in a value-agnostic way. I define autopoietic agents, systems which increase the number and influence of systems similar to themself.
I survey some more arguments given in the EA/rationalist community for why value/competition decoupling will occur. None of them decisively refute the continuing influence of the competitive pressure outlined in section 2.
Discussion of implications
Given that values remain subject to competitive pressures, alignment schemes which plan for an AI to competitively pursue its own autopoiesis while ultimately remaining in the service of human values are doomed to failure. This includes MIRI’s CEV and ARC’s alignment schemes.
On the other hand, this gives us less reason to fear the destruction of all value in the universe, since fanatical wrapper minds like paperclip maximizers will be competitively selected against.
If values and competition remain coupled, it might seem that we can have no influence on the future; I argue instead that competition can continue in a path-dependent manner which we can affect. I discuss two ways we could influence the future: (a) attempting to create good successor AGI, whose flourishing is morally valuable from our perspective, (b) using coordination and limited AI to buy time for (a).
Conclusion. In favor of maintaining epistemic equipoise.
Appendix. I discuss what sorts of environments select for greater or lesser degrees of value stability, and conjecture that nearness to qualitatively novel boundaries is an important factor.
1. Machine Intelligence and Value Stability: Three Scenarios
It’s plausible that, sometime this century, we will see the development of artificial general intelligence, software systems with the same cognitive capabilities as humans. The ability of such systems to copy and improve themselves could lead to a great increase in their numbers, speed, and capability, and ultimately a scenario in which more and more improvement occurs in a shorter and shorter span of time until there is an explosion of growth and change—a ‘singularity’. In the event, the resulting AI systems could be far more powerful than the combined forces of humanity, and their decisions would have a decisive influence on the future of the world and ultimately the universe. Thus, it seems very important to understand what kind of values such systems might have, and how they are likely to develop—values being defined as the properties of the universe they tend to optimize towards.
Here are three possible scenarios for future AI values. I believe all are plausible, but the third has been underdiscussed in the rationalist/EA communities.
Utility maximizer goes FOOM: The above process of self-improvement is concentrated in the first system to attain human-level intelligence. At some point during this process, internal ‘pressures’ towards coherence cause the system to become a utility maximizer, and at the same time develop a mature theory of reflective agency. Using this knowledge, the AGI completes the process of self-improvement while maintaining its value system, and thereafter uses its immense cognitive abilities to optimize our future lightcone in accordance with its utility function. Example.
Value lock-in via perfect delegation: Here there is still a process of rapidly increasing self-improvement, but spread out over the entire economy rather than concentrated in a single AI. There will be an entire ecosystem of many AI systems designing their superior future successors who in turn design their successors. Values, however, will become unprecedentedly stable: AI systems, freed of the foibles of biology, will be able to design successor systems which perfectly share their values. This means the initial distribution of values across AIs will become fixed and ultimately determine how the universe is optimized. Example.
Continuing Competition: There is again a process of accelerating change distributed over an economy of virtual agents. However, here it is not assumed that AI systems are necessarily able to create successors with perfect value stability. Instead, values will continue to change over time, being partially determined by the initial distribution of values, but also random drift and competitive forces. Example.
One central factor distinguishing the third scenario from the first two is value/competition decoupling—whether or not competitive forces continue to act on the dominant value systems. Whether or not this is true seems like a central factor influencing the expected goodness of the future and how we can influence it. Most alignment researchers seem to explicitly or implicitly assume that value/competition decoupling will occur—with MIRI favoring the first scenario above and Paul Christiano and other ‘prosaic’ alignment researchers favoring the second. While there has been some discussion of scenarios with continued coupling, most notably Robin Hanson’s ems, I believe their likelihood has been underrated and their likely implications underdiscussed.
2. Basic Arguments for and against Decoupling
There are many different arguments and types of evidence that you can bring to bear on the question of whether values and competition will remain coupled. I think of the following as being the most basic arguments for and against the continued influence of competition on values.
Basic Argument for Continued Coupling: Values and competition will remain coupled because agents with certain value systems will better be able to compete and gain resources than others. For example, agents that value hard work and competition might succeed better than hedonistic agents.
Counter-Argument: Past a certain level of sophistication and self-control, agents will be able to recognize if pursuing their values in the short-term disadvantages them in the long-term. They can then adopt the strategies that a more competitive agent would have used, and spend the acquired resources on their values later.
Counter-Counter-Argument: The counter-argument assumes that agents can costlessly ensure that their future self and successors share their values. But different value systems can have an easier or harder time with this—in particular, agents that tend to value any successors having power needn’t worry as much about verifying their successors’ value alignment.
2.1: Generality of the counter-counter-argument
At a high enough level of abstraction, this basic template covers most of the arguments for and against decoupling that I’ve seen; I think the CCA provides us with reason to think that continued coupling is plausible, but it’s far from certain. Stated so simply, however, it might sound nitpicky—isn’t this a rather specific scenario?
I instead think it’s very general, because the problem of designing one’s successor is a universally important one. This is clearly true even under the mundane circumstances of biological evolution and human life—but if a ‘singularity’ is indeed likely to occur soon, that implies there may be an even larger competitive advantage for agents that are willing to recklessly experiment with new designs for successors.
‘Ensuring successor alignment’ can also cover a broader range of scenarios than we would normally think of as ‘designing a new successor’. A ‘messy’ agent like a human might fear that it will experience value drift simply from undergoing novel experiences, so agents that care less about such value drift can go about life more freely. This is actually a factor people worry about in human life—e.g. people donating money while young because they fear losing the desire to donate, or deeply religious people who fear learning new things because they might disrupt their faith. These sorts of commitments can make it difficult to accumulate power and knowledge.
Value stability is also important in deciding how broadly and freely to disperse copies of oneself. If you aren’t certain that each of the copies will maintain your values, and can’t establish strong coordination mechanisms, then you may be reluctant to duplicate yourself recklessly. History is filled with tales of countries whose colonies or mercenaries ultimately broke with them: and yet, some of those colonies have been extremely influential, and thus so have their reckless parent countries. These incentives away from value stability can also apply fractally, increasing the influence obtained by cognitive sub-processes that increase their own influence via reckless actions—e.g. if people find that bold, risky moves pay off in certain environments, they may be more inclined to take similarly risky moves in the future, including in ways that threaten to change their overall values.
Overall, I think of the CCA as pointing out a general ‘force’ pushing agents away from perfect value stability. Much as coherence theorems can be thought of as implying a force pushing towards goal-directed behavior, I think the arguments above imply a force pushing agents away from monomaniacal obsession with value stability.
2.2: Autopoiesis
Here’s another way of framing the discussion. Define the class of autopoietic agents to be beings whose actions increase(in expectation) the number and influence of beings similar to itself in the future.
Autopoietic agents definitionally increase in power and influence. The definition is behavior; an agent successfully optimizing its successors’ influence is autopoietic, but an effective paperclip maximizer could also be autopoietic; for that matter, agents with deontological or other types of value systems could be autopoietic, if their value systems lead to them making decisions that increase their influence on the future. I think autopoiesis is a useful concept to have because it is the agents that are most effectively autopoietic that will ultimately control the future—basically by definition.
Different autopoietic agents can have successors that are more or less similar to them; the above arguments re:decoupling suggests that there is a competitive pressure pushing such agents from maximal similarity—or fidelity—between themselves and their successors.
In addition to this pressure, there is another pushing towards greater value stability. This is simply the fact that agents who create beings more similar to them, will have more-similar descendants in the future.
Taken together, these pressures create an optimal level of value stability that will be selected for. This level probably varies a lot depending on the circumstances—I discuss some of the factors that might favor a greater or lesser level of stability in an appendix. For the purposes of this post, the important point is that this optimal level is not necessarily the maximum possible
If this remains the case into the far future, there will be a competitive pressure towards value systems which place a non-maximal value on stability. In particular, this implies decoupling of values and competition will not occur: both directly because of this pressure, and because non-maximal successor fidelity will lead to a proliferation of value systems which can be selected amongst.
3. Further Arguments for Decoupling
So those are some basic arguments for why values might remain subject to competitive forces. I’ve collected some other common arguments in favor of decoupling and responses below.
Argument: Unlike messy humans, future AI systems will have a modular architecture(“wrapper mind”) like AIXI in which there is an explicit utility function component separated from world-model and planning components. Value stability under self-modification can easily be achieved by keeping the utility function constant while the world-model and planning components are changed.
Response: It is far from certain that powerful AI systems will have this form. Current powerful AI systems are too messy for such a simple approach to successor fidelity; difficulties involving mesa-optimizers, ontology identification/ELK, and reward not being the optimization target mean that merely keeping a component of your system labeled ‘utility function’ constant is not guaranteed to actually preserve your values.
3.2: The Orthogonality Thesis
Argument: The orthogonalitythesis states it’s possible to create minds of arbitrary capability levels pursuing arbitrary goals. Thus there exist minds able to succeed at any given level of competition while holding any values.
Response: Although there may exist minds holding arbitrary goals able to compete equally well, that does not imply that they are all equally likely to come into existence. In particular agents with some value systems may find it harder to design their successors than others.
3.3: Better AI Copying & Surveillance
Argument: Human values are unstable in large part due to foibles of our biology, notably mortality(causing value churn when people holding given values die) and our inability to read others’ minds(making it more difficult to ensure others share our values). AI systems will not be constrained in this way and so will have a much higher level of value stability.
Response: Human value instability is not purely caused by biological quirks. Societies differ in how strongly they attempt to impart their values on their members, e.g. more authoritarian governments attempt to control what their subjects are allowed to say to each other in order to suppress dissent. Despite this, the most powerful human societies of today are not those that most stringently attempt to ensure their own stability, suggesting that their are competitive pressures acting against value stability in humans, not just biological limits. AIs may also have a harder time ensuring value stability in some respects, due to e.g. a much greater ability to alter their mind architecture.
3.4: Coherence Theorems
Argument: Coherence theorems dictate that any agent is either an expected utility maximizer(which will theorefore attempt to ensure the value alignment of its successors) or is predictably throwing away resources(and is therefore noncompetitive)
Response: Agents that attempt to accrue power to themselves and their successors should not be exploitable if they are competent. An agent that autopoietically values empowering its successors can be seen as maximizing a utility function—one that assigns higher value to trajectories of the universe where its successors are plentiful and have influence. The arguments given above imply that some utility functions, those that assign value to a broad range of successors, can be more easily maximized than others.
3.5: Unipolarity
Argument: The world has seen a trend towards greater centralization of political power. If this trend continues, or if some other force causes the creation of a singleton(e.g. a single AI greatly increasing in power in a FOOM scenario) then the singleton can prevent the occurrence of further competition.
Response: Even a singleton will likely still have competitive pressures of some sort acting on its sub-components. A singleton that totally suppresses internal competition faces the risk of rotting and ultimately being destroyed. Furthermore, it’s not clear if the creation of a singleton is likely to occur. Scenarios where a single agent suddenly gains massively in power are seen as implausible by many. The historical trend towards political centralization might be a side effect of being confined to the Earth: in a future where our civilization has become interstellar, it may be difficult to enforce governance over breakaways who expand into space at close to light-speed.
3.6: Improved Representations
I’ve never seen anyone explicitly make this argument, but I think it’s interesting and might be an underlying driver of intuitions.
Argument: Considering past autopoietic systems, we have seen a continual increase in the quality and comprehensiveness of their representations of the world. While simple baceteria can only track chemical gradients in their immediate environment, humans have rich, accurate representations of the world and its workings stretching to the limits of the reachable universe. Past value instability was largely caused by agents making decisions based on faulty or incomplete representations of the world, so we should expect values to stabilize as representations become closer to accurately reflecting the world.
Response: Just as there is a competitive pressure away from maximal value stability, there is also one against maximally accurate representations—basically, if your representation is good enough then you may be put at a competitive disadvantage if you spend time improving it further rather than using it to seize resources. There also may be Gödelian obstacles to having a fully accurate representation of systems as cognitively powerful as one’s self, such as one would need to ensure perfect value stability of successors.
4. Implications
As mentioned above, I’m not certain that values and competition will remain coupled. For the rest of this piece, however, I’m going to assume that they will, and analyze what the implications would be for the likely outcome of AGI development and policy decisions.
4.1: Indefinitely Scalable Alignment Schemes
One prominent class of alignment schemes proposes that we might achieve competitive, scalable alignment—that is, we might create agents whose goal is to empower humanity, and which can scale to arbitrarily high capability levels while remaining competitive with arbitrary unaligned AI. In a multipolar singularity, such agents could optimize human values by first undergoing autopoietic expansion to gain control of resources, later using these resources to optimize human values. In strong forms, this doesn’t require human-controlled AI to prevent the creation of unaligned AI—they could fight or negotiate with such AI instead, and(by the competitiveness assumption) should in principle succeed about as well as the unaligned AI. The ELK report mentions one such alignment scheme in an appendix, defining a utility function for an AI via an elaborate hypothetical process of delegation. CEV is another example of a utility function that we could give to a fixed-goal-optimizing AGI, although MIRI usually envisions a unipolar singularity.
If values remain subject to competitive pressure indefinitely, this class of schemes cannot work—at least in their strongest form. This is because such schemes require agents that are capable of maintaining their goal of maximizing human values while undergoing a series of extreme self-modifications, in total representing an amount of change and growth comparable to all that has occurred in Earth’s history, all while competing with other equally powerful beings doing the same. Clearly this requires an extreme degree of value stability on the part of the human-values-optimizing AI, so if there is a competitive advantage to agents/sub-processes with more labile value systems, the human-values-optimizing AI has little hope of effectively gaining power while maintaining allegiance to human values.
So, “aligning” AI in this strong sense is more difficult in a world with value/competition coupling. Of course, more limited forms of alignment could still be possible, such as MIRI’s “Task AI” intended to be superintelligent in a particular domain but not more broadly, or act-based agents with limited capabilities.
4.2: Likelihood of all Value in the Universe being Destroyed
Given this difficulty, does continued value/competition coupling imply that all value in the universe(from our perspective) is doomed to be destroyed?
I don’t think this is necessarily the case. While value/competition coupling does make alignment harder, it also makes unaligned AI less bad in expectation. In particular, it means that we are not as likely to create wrapper minds that fanatically re-shape the future according to whatever arbitrary values they are initialized with.
If future AI systems are not wrapper-mind-like, what sort of motivational system will they have? It’s impossible to say in any detail. But if they exist in a world full of continuing competition and value diversification, in some ways resembling the evolutionary process that produced us, I think it’s morally reasonable to think of them as somewhat like an alien species. While obviously I wouldn’t be happy about humanity being disempowered and replaced by an unknown alien species, in expectation it’s better than paperclips. I’d estimate that the value of a future controlled by such an ‘alien species’ is in expectation 10% as good as one in which humans remain in control. Furthermore, as I’ll discuss in the next section, we could improve that number by deliberately creating AIs whose autopoiesis we would regard as valuable.
5. Policy
5.1 Possibility of Influencing the Future
In a world with continued value/competition coupling, you might wonder whether having a lasting influence on the long-term future is even possible, since competitive forces will push the dominant value system towards whatever is globally optimal anyway.
However, that some competition persists indefinitely does not imply that there is a single global optimum we are doomed to be sucked into. Most of the competitive landscape faced by future agents consists of other agents: there can be many different stable Nash equilibria. At the extreme, this simply recovers decoupling, but it’s also possible for some path-dependence to co-exist with some competition. This is what we’ve seen historically: we still carry the idiosyncratic genetic legacy and many behavioral traits of organisms from hundreds of millions of years ago, although there has been fairly harsh competition during this entire period.
The difference between this sort of path-dependence and locked-in value stability is that, while we can anticipate that our descendants will share many features and values inherited from us, we can’t predict ahead of time that any particular feature will remain perfectly stable. Compared to aligning a fixed-goal-AGI, this feels like a much more robust way of passing on our values: like valuing people because you think they are intrinsically good, VS. valuing a sociopath who you have trained or incentivized to pursue what you regard as good.
One way of thinking about the future in non-decoupled worlds is as a continuation of regular history, just at a faster tempo. When thinking about the singularity, there is a tendency to see it, in far mode, as a simple process that will produce a simple outcome, e.g. a utility-maximizing AGI. It might be better to think of it as a vast stretch of time, full of all the complications and twists of regular history, that happens to be compressed into a smaller number of cycles around the Sun than usual. Designing our AGI successors in such a world is similar to passing on control to our children: we can’t anticipate every possible future challenge they will face, but what we can hope to do is pass on our values and knowledge, to give them the best shot possible at navigating whatever future challenges come up, including the challenges of managing future competition and value drift. The big difference is that we can’t rely on biology to pass on our implict values as we usually do: instead we will need to figure out what sorts of AGIs we can create that we would be happy to see flourishing on their own terms: a good successor AI, rather than an aligned one.
5.2 Good Successors
So how could we create a good successor AI? Are there any such things?
One example of AIs that would count as good successors: ems. Creating a society of highly-accurate human brain emulations would constitute a good successor AI, since they would by definition share human values, and would be in a far better position than baseline humans to navigate the singularity, due to their ability to rapidly copy and alter themselves.[2] Unfortunately it doesn’t seem likely that we’re going to be able to make ems before the advent of human-level AI.
As an alternative, we could instead create AI that is similar enough to the brain that it retains moral value from our perspective. There are lots of features of human brains that are pretty idiosyncratic to our biology and we would be fine with losing; on a larger scale, I suspect most mammal species would produce a civilization we would regard as morally valuable, if upgraded in intelligence and uploaded. The big question is how complex are the features of human/mammal brain that are most important for being morally valuable.
There are currently a few research agendas attempting to reverse-engineer how human values actually work on a neurological level, for instance Steve Byrnes’ model of brain-like AGI and Shard Theory. Optimistically, if they succeed and find that our value system is algorithmically simple, creating good successor AI might be as simple as copying that algorithm to silicon.[3]
This earlier-linked post by Paul contains another proposal for how we might create good successor AI, by simulating alien evolution(and presenting the aliens with a recursive copy of the same scenario). This seems like it might be difficult to pull off in full detail before HLAI arrives, but less ambitious versions of the same proposal could still be a useful tool in obtaining a good successor AI. “Sympathy with other value systems” also might be a key desideratum for any potential good successor.
5.3 Delay
In worlds where competition continues to influence values, our main route for affecting the singularity and beyond is developing good successor AI. But this doesn’t mean that direct research on such AI is the only worthwhile thing we can do—we can also extend the time which we have for deliberation by delaying AGI deployment. A lot of this depends on the details of geopolitical policy and is beyond the scope of this essay, so my remarks here will be somewhat brief.
Coordination is obviously crucial. Developing better, more rigorous versions of arguments for AI risk could be quite helpful here, as could spreading awareness of existing arguments among influential people and the broader public.
Limited AI systems could also be helpful. The above-mentioned Task AGI, or act-based agents, could be deployed to detect and counteract the emergence of unaligned general AGI. Such systems could also be useful for consuming the ‘free energy’(h/t Paul) that an unaligned AI would use to expand, such as by running ML models designed to find and patch holes in computer security.
If value/competition coupling continues to hold, then there is a limit to how long we can delay without incurring a competitive disadvantage or rotting. The optimal amount of time to delay will depend on the details of the geopolitical situation and AI development, and will likely have to be worked out as we go.
6. Conclusion
In closing, I again emphasize that I am not certain that value/competition coupling will continue. However, reflecting on all the arguments and evidence above, my overall feeling is that it is (at least) comparably likely to the alternatives. In some ways the picture of the singularity thus painted might seem a bit less urgent than the typical arguments suggest: it is harder for us to permanently lock in our current values, but also less likely that all value(from our perspective) will be permanently destroyed. The stakes are only mildly less apocalyptic, however—it is still the case that a massive rupture in the normal line of succession may be coming soon, with little time for us to prepare.
In the face of such an event, urgency is appropriate. Urgency is not all that is needed, though—what is equally important is epistemic equipoise, the ability to carefully track what you do know and what you don’t. Maintaining this equipoise is likely to be a necessity if we are to navigate the most important century successfully. My hope is that by bringing attention to some neglected arguments, this essay can help the rationalist/EA community track more possible futures and be ready for whatever may happen.
( The time spent writing this post was sponsored by the FTX Future Fund regranting program. Thanks to Simeon Campos for discussion and encouragement and Justis Mills from the LW team for help with editing)
Appendix: Value Stability and Boundaries
Epistemic status: pure, unbridled speculation
The optimal level of value stability plays a crucial role in the analysis above. What features of the environment and agents affect this optimal level? I conjecture being near a complex or novel boundary, either in physical or conceptual space, makes the optimal level of value stability lower; being far from complex, novel boundaries makes the optimal level higher.
By “being near a boundary” I mean having access to relatively unclaimed/virgin/unoptimized resources. In physical space this would be gaining access to some previously unoccupied area of space; for example a spacefaring civilization expanding into untouched solar systems. In conceptual space this is coming up with a novel class of useful ideas, for instance new processor designs or neural net architectures. By “qualitatively novel boundary” I mean a boundary that is not just adjacent to new resources/ideas, but resources/ideas configured in a different way from previous boundaries that the agents in question have encountered.
When near a boundary, fresh resources are plentiful, so agents there can, on average, gain in power/number of descendants. In places far from boundaries, where there is a fixed supply of resources, the average increase in power/descendants of a population of agents must be equal to one. Hence, agents near boundaries have more to gain from reckless expansion. Agents which quickly grab a lot of the new resources are selected for.
“Qualitatively novel” boundaries provide an additional pressure away from value stability in that their novelty makes it difficult to rigorously verify the behaviour of successors across them. A completely new class of mind architecture might promise great gains in capability, but make proving alignment harder. It may be harder for successors to coordinate in totally uncharted & unknown territory.
The property of “being a novel boundary” is not binary. The physical and conceptual landscapes are fractal, containing nested sub-divisions with their own boundaries. Agents will differ in what they consider to be ‘uncharted territory’—territory that has only been lightly exploited by one class of agents might appear optimal for expansion to a more sophisticated class. It seems plausible that the future will contain enough novel boundaries in conceptual and physical space to incentivize non-maximal value stability for a long subjective time.
TBF I’m not sure if I’ve seen anyone make this exact argument, at least in such a simple-minded way; nevertheless I think it’s an important background driver of intuitions so I’m including it
You might dispute that since ems share human values, they are in fact aligned with humanity, not just good successors. Here by aligned I mean “aligned with their human operators”, so a society of ems would not qualify if they decided to pursue their own interests rather than those of their operators.
This is not to say that either research agenda is only useful for creating good successor AI—the same insights could be useful for creating ‘traditional’ aligned AI as well.
Will Values and Competition Decouple?
(Cross-posted from LessWrong.)
There are a great many forces shaping the evolution of the universe. Among them, the values of agents—systems which attempt to optimize, or steer the future towards certain configurations over others—seem likely to have a dominant influence on the long-term future. The values of the agents around now have been largely determined by competitive pressures. Many people in the rationalist/EA community seem to take it for granted that this is soon going to change, and we will enter an era in which values and competition are completely decoupled; the values of the beings around at the time of this decoupling will be “locked in” and determine the shape of the entire future. I think is it plausible(>30% probability) that they are wrong, and that competition will continue, with at least some strength, indefinitely. If this is true, it has major implications for the likely trajectory of the world and how we should go about influencing the long-term future. In this blog post I’ll lay out why I think this and what the implications are.
Epistemic status: not confident that the thesis is correct; I am confident that the community should be allocating more probability mass to this scenario than they currently are. If you like, imagine prepending every statement with “there is at least a 30% probability that”.
SUMMARY
I sketch three possible scenarios for what the value systems of machine intelligences might look like. In two of these scenarios, values and competition are totally decoupled; in the third, they remain partially coupled.
I present the most basic arguments for and against the occurrence of decoupling. Briefly, the difficulty of ensuring successor alignment might generate competitive pressure towards value systems that try to accrue power to their successors in a value-agnostic way. I define autopoietic agents, systems which increase the number and influence of systems similar to themself.
I survey some more arguments given in the EA/rationalist community for why value/competition decoupling will occur. None of them decisively refute the continuing influence of the competitive pressure outlined in section 2.
Discussion of implications
Given that values remain subject to competitive pressures, alignment schemes which plan for an AI to competitively pursue its own autopoiesis while ultimately remaining in the service of human values are doomed to failure. This includes MIRI’s CEV and ARC’s alignment schemes.
On the other hand, this gives us less reason to fear the destruction of all value in the universe, since fanatical wrapper minds like paperclip maximizers will be competitively selected against.
If values and competition remain coupled, it might seem that we can have no influence on the future; I argue instead that competition can continue in a path-dependent manner which we can affect. I discuss two ways we could influence the future: (a) attempting to create good successor AGI, whose flourishing is morally valuable from our perspective, (b) using coordination and limited AI to buy time for (a).
Conclusion. In favor of maintaining epistemic equipoise.
Appendix. I discuss what sorts of environments select for greater or lesser degrees of value stability, and conjecture that nearness to qualitatively novel boundaries is an important factor.
1. Machine Intelligence and Value Stability: Three Scenarios
It’s plausible that, sometime this century, we will see the development of artificial general intelligence, software systems with the same cognitive capabilities as humans. The ability of such systems to copy and improve themselves could lead to a great increase in their numbers, speed, and capability, and ultimately a scenario in which more and more improvement occurs in a shorter and shorter span of time until there is an explosion of growth and change—a ‘singularity’. In the event, the resulting AI systems could be far more powerful than the combined forces of humanity, and their decisions would have a decisive influence on the future of the world and ultimately the universe. Thus, it seems very important to understand what kind of values such systems might have, and how they are likely to develop—values being defined as the properties of the universe they tend to optimize towards.
Here are three possible scenarios for future AI values. I believe all are plausible, but the third has been underdiscussed in the rationalist/EA communities.
Utility maximizer goes FOOM: The above process of self-improvement is concentrated in the first system to attain human-level intelligence. At some point during this process, internal ‘pressures’ towards coherence cause the system to become a utility maximizer, and at the same time develop a mature theory of reflective agency. Using this knowledge, the AGI completes the process of self-improvement while maintaining its value system, and thereafter uses its immense cognitive abilities to optimize our future lightcone in accordance with its utility function. Example.
Value lock-in via perfect delegation: Here there is still a process of rapidly increasing self-improvement, but spread out over the entire economy rather than concentrated in a single AI. There will be an entire ecosystem of many AI systems designing their superior future successors who in turn design their successors. Values, however, will become unprecedentedly stable: AI systems, freed of the foibles of biology, will be able to design successor systems which perfectly share their values. This means the initial distribution of values across AIs will become fixed and ultimately determine how the universe is optimized. Example.
Continuing Competition: There is again a process of accelerating change distributed over an economy of virtual agents. However, here it is not assumed that AI systems are necessarily able to create successors with perfect value stability. Instead, values will continue to change over time, being partially determined by the initial distribution of values, but also random drift and competitive forces. Example.
One central factor distinguishing the third scenario from the first two is value/competition decoupling—whether or not competitive forces continue to act on the dominant value systems. Whether or not this is true seems like a central factor influencing the expected goodness of the future and how we can influence it. Most alignment researchers seem to explicitly or implicitly assume that value/competition decoupling will occur—with MIRI favoring the first scenario above and Paul Christiano and other ‘prosaic’ alignment researchers favoring the second. While there has been some discussion of scenarios with continued coupling, most notably Robin Hanson’s ems, I believe their likelihood has been underrated and their likely implications underdiscussed.
2. Basic Arguments for and against Decoupling
There are many different arguments and types of evidence that you can bring to bear on the question of whether values and competition will remain coupled. I think of the following as being the most basic arguments for and against the continued influence of competition on values.
Basic Argument for Continued Coupling: Values and competition will remain coupled because agents with certain value systems will better be able to compete and gain resources than others. For example, agents that value hard work and competition might succeed better than hedonistic agents.
Counter-Argument: Past a certain level of sophistication and self-control, agents will be able to recognize if pursuing their values in the short-term disadvantages them in the long-term. They can then adopt the strategies that a more competitive agent would have used, and spend the acquired resources on their values later.
Counter-Counter-Argument: The counter-argument assumes that agents can costlessly ensure that their future self and successors share their values. But different value systems can have an easier or harder time with this—in particular, agents that tend to value any successors having power needn’t worry as much about verifying their successors’ value alignment.
2.1: Generality of the counter-counter-argument
At a high enough level of abstraction, this basic template covers most of the arguments for and against decoupling that I’ve seen; I think the CCA provides us with reason to think that continued coupling is plausible, but it’s far from certain. Stated so simply, however, it might sound nitpicky—isn’t this a rather specific scenario?
I instead think it’s very general, because the problem of designing one’s successor is a universally important one. This is clearly true even under the mundane circumstances of biological evolution and human life—but if a ‘singularity’ is indeed likely to occur soon, that implies there may be an even larger competitive advantage for agents that are willing to recklessly experiment with new designs for successors.
‘Ensuring successor alignment’ can also cover a broader range of scenarios than we would normally think of as ‘designing a new successor’. A ‘messy’ agent like a human might fear that it will experience value drift simply from undergoing novel experiences, so agents that care less about such value drift can go about life more freely. This is actually a factor people worry about in human life—e.g. people donating money while young because they fear losing the desire to donate, or deeply religious people who fear learning new things because they might disrupt their faith. These sorts of commitments can make it difficult to accumulate power and knowledge.
Value stability is also important in deciding how broadly and freely to disperse copies of oneself. If you aren’t certain that each of the copies will maintain your values, and can’t establish strong coordination mechanisms, then you may be reluctant to duplicate yourself recklessly. History is filled with tales of countries whose colonies or mercenaries ultimately broke with them: and yet, some of those colonies have been extremely influential, and thus so have their reckless parent countries. These incentives away from value stability can also apply fractally, increasing the influence obtained by cognitive sub-processes that increase their own influence via reckless actions—e.g. if people find that bold, risky moves pay off in certain environments, they may be more inclined to take similarly risky moves in the future, including in ways that threaten to change their overall values.
Overall, I think of the CCA as pointing out a general ‘force’ pushing agents away from perfect value stability. Much as coherence theorems can be thought of as implying a force pushing towards goal-directed behavior, I think the arguments above imply a force pushing agents away from monomaniacal obsession with value stability.
2.2: Autopoiesis
Here’s another way of framing the discussion. Define the class of autopoietic agents to be beings whose actions increase(in expectation) the number and influence of beings similar to itself in the future. Autopoietic agents definitionally increase in power and influence. The definition is behavior; an agent successfully optimizing its successors’ influence is autopoietic, but an effective paperclip maximizer could also be autopoietic; for that matter, agents with deontological or other types of value systems could be autopoietic, if their value systems lead to them making decisions that increase their influence on the future. I think autopoiesis is a useful concept to have because it is the agents that are most effectively autopoietic that will ultimately control the future—basically by definition.
Different autopoietic agents can have successors that are more or less similar to them; the above arguments re:decoupling suggests that there is a competitive pressure pushing such agents from maximal similarity—or fidelity—between themselves and their successors.
In addition to this pressure, there is another pushing towards greater value stability. This is simply the fact that agents who create beings more similar to them, will have more-similar descendants in the future.
Taken together, these pressures create an optimal level of value stability that will be selected for. This level probably varies a lot depending on the circumstances—I discuss some of the factors that might favor a greater or lesser level of stability in an appendix. For the purposes of this post, the important point is that this optimal level is not necessarily the maximum possible
If this remains the case into the far future, there will be a competitive pressure towards value systems which place a non-maximal value on stability. In particular, this implies decoupling of values and competition will not occur: both directly because of this pressure, and because non-maximal successor fidelity will lead to a proliferation of value systems which can be selected amongst.
3. Further Arguments for Decoupling
So those are some basic arguments for why values might remain subject to competitive forces. I’ve collected some other common arguments in favor of decoupling and responses below.
3.1: Modular goal architectures[1]
Argument: Unlike messy humans, future AI systems will have a modular architecture(“wrapper mind”) like AIXI in which there is an explicit utility function component separated from world-model and planning components. Value stability under self-modification can easily be achieved by keeping the utility function constant while the world-model and planning components are changed.
Response: It is far from certain that powerful AI systems will have this form. Current powerful AI systems are too messy for such a simple approach to successor fidelity; difficulties involving mesa-optimizers, ontology identification/ELK, and reward not being the optimization target mean that merely keeping a component of your system labeled ‘utility function’ constant is not guaranteed to actually preserve your values.
3.2: The Orthogonality Thesis
Argument: The orthogonality thesis states it’s possible to create minds of arbitrary capability levels pursuing arbitrary goals. Thus there exist minds able to succeed at any given level of competition while holding any values.
Response: Although there may exist minds holding arbitrary goals able to compete equally well, that does not imply that they are all equally likely to come into existence. In particular agents with some value systems may find it harder to design their successors than others.
3.3: Better AI Copying & Surveillance
Argument: Human values are unstable in large part due to foibles of our biology, notably mortality(causing value churn when people holding given values die) and our inability to read others’ minds(making it more difficult to ensure others share our values). AI systems will not be constrained in this way and so will have a much higher level of value stability.
Response: Human value instability is not purely caused by biological quirks. Societies differ in how strongly they attempt to impart their values on their members, e.g. more authoritarian governments attempt to control what their subjects are allowed to say to each other in order to suppress dissent. Despite this, the most powerful human societies of today are not those that most stringently attempt to ensure their own stability, suggesting that their are competitive pressures acting against value stability in humans, not just biological limits. AIs may also have a harder time ensuring value stability in some respects, due to e.g. a much greater ability to alter their mind architecture.
3.4: Coherence Theorems
Argument: Coherence theorems dictate that any agent is either an expected utility maximizer(which will theorefore attempt to ensure the value alignment of its successors) or is predictably throwing away resources(and is therefore noncompetitive)
Response: Agents that attempt to accrue power to themselves and their successors should not be exploitable if they are competent. An agent that autopoietically values empowering its successors can be seen as maximizing a utility function—one that assigns higher value to trajectories of the universe where its successors are plentiful and have influence. The arguments given above imply that some utility functions, those that assign value to a broad range of successors, can be more easily maximized than others.
3.5: Unipolarity
Argument: The world has seen a trend towards greater centralization of political power. If this trend continues, or if some other force causes the creation of a singleton(e.g. a single AI greatly increasing in power in a FOOM scenario) then the singleton can prevent the occurrence of further competition.
Response: Even a singleton will likely still have competitive pressures of some sort acting on its sub-components. A singleton that totally suppresses internal competition faces the risk of rotting and ultimately being destroyed. Furthermore, it’s not clear if the creation of a singleton is likely to occur. Scenarios where a single agent suddenly gains massively in power are seen as implausible by many. The historical trend towards political centralization might be a side effect of being confined to the Earth: in a future where our civilization has become interstellar, it may be difficult to enforce governance over breakaways who expand into space at close to light-speed.
3.6: Improved Representations
I’ve never seen anyone explicitly make this argument, but I think it’s interesting and might be an underlying driver of intuitions.
Argument: Considering past autopoietic systems, we have seen a continual increase in the quality and comprehensiveness of their representations of the world. While simple baceteria can only track chemical gradients in their immediate environment, humans have rich, accurate representations of the world and its workings stretching to the limits of the reachable universe. Past value instability was largely caused by agents making decisions based on faulty or incomplete representations of the world, so we should expect values to stabilize as representations become closer to accurately reflecting the world.
Response: Just as there is a competitive pressure away from maximal value stability, there is also one against maximally accurate representations—basically, if your representation is good enough then you may be put at a competitive disadvantage if you spend time improving it further rather than using it to seize resources. There also may be Gödelian obstacles to having a fully accurate representation of systems as cognitively powerful as one’s self, such as one would need to ensure perfect value stability of successors.
4. Implications
As mentioned above, I’m not certain that values and competition will remain coupled. For the rest of this piece, however, I’m going to assume that they will, and analyze what the implications would be for the likely outcome of AGI development and policy decisions.
4.1: Indefinitely Scalable Alignment Schemes
One prominent class of alignment schemes proposes that we might achieve competitive, scalable alignment—that is, we might create agents whose goal is to empower humanity, and which can scale to arbitrarily high capability levels while remaining competitive with arbitrary unaligned AI. In a multipolar singularity, such agents could optimize human values by first undergoing autopoietic expansion to gain control of resources, later using these resources to optimize human values. In strong forms, this doesn’t require human-controlled AI to prevent the creation of unaligned AI—they could fight or negotiate with such AI instead, and(by the competitiveness assumption) should in principle succeed about as well as the unaligned AI. The ELK report mentions one such alignment scheme in an appendix, defining a utility function for an AI via an elaborate hypothetical process of delegation. CEV is another example of a utility function that we could give to a fixed-goal-optimizing AGI, although MIRI usually envisions a unipolar singularity.
If values remain subject to competitive pressure indefinitely, this class of schemes cannot work—at least in their strongest form. This is because such schemes require agents that are capable of maintaining their goal of maximizing human values while undergoing a series of extreme self-modifications, in total representing an amount of change and growth comparable to all that has occurred in Earth’s history, all while competing with other equally powerful beings doing the same. Clearly this requires an extreme degree of value stability on the part of the human-values-optimizing AI, so if there is a competitive advantage to agents/sub-processes with more labile value systems, the human-values-optimizing AI has little hope of effectively gaining power while maintaining allegiance to human values.
So, “aligning” AI in this strong sense is more difficult in a world with value/competition coupling. Of course, more limited forms of alignment could still be possible, such as MIRI’s “Task AI” intended to be superintelligent in a particular domain but not more broadly, or act-based agents with limited capabilities.
4.2: Likelihood of all Value in the Universe being Destroyed
Given this difficulty, does continued value/competition coupling imply that all value in the universe(from our perspective) is doomed to be destroyed?
I don’t think this is necessarily the case. While value/competition coupling does make alignment harder, it also makes unaligned AI less bad in expectation. In particular, it means that we are not as likely to create wrapper minds that fanatically re-shape the future according to whatever arbitrary values they are initialized with.
If future AI systems are not wrapper-mind-like, what sort of motivational system will they have? It’s impossible to say in any detail. But if they exist in a world full of continuing competition and value diversification, in some ways resembling the evolutionary process that produced us, I think it’s morally reasonable to think of them as somewhat like an alien species. While obviously I wouldn’t be happy about humanity being disempowered and replaced by an unknown alien species, in expectation it’s better than paperclips. I’d estimate that the value of a future controlled by such an ‘alien species’ is in expectation 10% as good as one in which humans remain in control. Furthermore, as I’ll discuss in the next section, we could improve that number by deliberately creating AIs whose autopoiesis we would regard as valuable.
5. Policy
5.1 Possibility of Influencing the Future
In a world with continued value/competition coupling, you might wonder whether having a lasting influence on the long-term future is even possible, since competitive forces will push the dominant value system towards whatever is globally optimal anyway.
However, that some competition persists indefinitely does not imply that there is a single global optimum we are doomed to be sucked into. Most of the competitive landscape faced by future agents consists of other agents: there can be many different stable Nash equilibria. At the extreme, this simply recovers decoupling, but it’s also possible for some path-dependence to co-exist with some competition. This is what we’ve seen historically: we still carry the idiosyncratic genetic legacy and many behavioral traits of organisms from hundreds of millions of years ago, although there has been fairly harsh competition during this entire period.
The difference between this sort of path-dependence and locked-in value stability is that, while we can anticipate that our descendants will share many features and values inherited from us, we can’t predict ahead of time that any particular feature will remain perfectly stable. Compared to aligning a fixed-goal-AGI, this feels like a much more robust way of passing on our values: like valuing people because you think they are intrinsically good, VS. valuing a sociopath who you have trained or incentivized to pursue what you regard as good.
One way of thinking about the future in non-decoupled worlds is as a continuation of regular history, just at a faster tempo. When thinking about the singularity, there is a tendency to see it, in far mode, as a simple process that will produce a simple outcome, e.g. a utility-maximizing AGI. It might be better to think of it as a vast stretch of time, full of all the complications and twists of regular history, that happens to be compressed into a smaller number of cycles around the Sun than usual. Designing our AGI successors in such a world is similar to passing on control to our children: we can’t anticipate every possible future challenge they will face, but what we can hope to do is pass on our values and knowledge, to give them the best shot possible at navigating whatever future challenges come up, including the challenges of managing future competition and value drift. The big difference is that we can’t rely on biology to pass on our implict values as we usually do: instead we will need to figure out what sorts of AGIs we can create that we would be happy to see flourishing on their own terms: a good successor AI, rather than an aligned one.
5.2 Good Successors
So how could we create a good successor AI? Are there any such things?
One example of AIs that would count as good successors: ems. Creating a society of highly-accurate human brain emulations would constitute a good successor AI, since they would by definition share human values, and would be in a far better position than baseline humans to navigate the singularity, due to their ability to rapidly copy and alter themselves.[2] Unfortunately it doesn’t seem likely that we’re going to be able to make ems before the advent of human-level AI.
As an alternative, we could instead create AI that is similar enough to the brain that it retains moral value from our perspective. There are lots of features of human brains that are pretty idiosyncratic to our biology and we would be fine with losing; on a larger scale, I suspect most mammal species would produce a civilization we would regard as morally valuable, if upgraded in intelligence and uploaded. The big question is how complex are the features of human/mammal brain that are most important for being morally valuable.
There are currently a few research agendas attempting to reverse-engineer how human values actually work on a neurological level, for instance Steve Byrnes’ model of brain-like AGI and Shard Theory. Optimistically, if they succeed and find that our value system is algorithmically simple, creating good successor AI might be as simple as copying that algorithm to silicon.[3]
This earlier-linked post by Paul contains another proposal for how we might create good successor AI, by simulating alien evolution(and presenting the aliens with a recursive copy of the same scenario). This seems like it might be difficult to pull off in full detail before HLAI arrives, but less ambitious versions of the same proposal could still be a useful tool in obtaining a good successor AI. “Sympathy with other value systems” also might be a key desideratum for any potential good successor.
5.3 Delay
In worlds where competition continues to influence values, our main route for affecting the singularity and beyond is developing good successor AI. But this doesn’t mean that direct research on such AI is the only worthwhile thing we can do—we can also extend the time which we have for deliberation by delaying AGI deployment. A lot of this depends on the details of geopolitical policy and is beyond the scope of this essay, so my remarks here will be somewhat brief.
Coordination is obviously crucial. Developing better, more rigorous versions of arguments for AI risk could be quite helpful here, as could spreading awareness of existing arguments among influential people and the broader public.
Limited AI systems could also be helpful. The above-mentioned Task AGI, or act-based agents, could be deployed to detect and counteract the emergence of unaligned general AGI. Such systems could also be useful for consuming the ‘free energy’(h/t Paul) that an unaligned AI would use to expand, such as by running ML models designed to find and patch holes in computer security.
If value/competition coupling continues to hold, then there is a limit to how long we can delay without incurring a competitive disadvantage or rotting. The optimal amount of time to delay will depend on the details of the geopolitical situation and AI development, and will likely have to be worked out as we go.
6. Conclusion
In closing, I again emphasize that I am not certain that value/competition coupling will continue. However, reflecting on all the arguments and evidence above, my overall feeling is that it is (at least) comparably likely to the alternatives. In some ways the picture of the singularity thus painted might seem a bit less urgent than the typical arguments suggest: it is harder for us to permanently lock in our current values, but also less likely that all value(from our perspective) will be permanently destroyed. The stakes are only mildly less apocalyptic, however—it is still the case that a massive rupture in the normal line of succession may be coming soon, with little time for us to prepare.
In the face of such an event, urgency is appropriate. Urgency is not all that is needed, though—what is equally important is epistemic equipoise, the ability to carefully track what you do know and what you don’t. Maintaining this equipoise is likely to be a necessity if we are to navigate the most important century successfully. My hope is that by bringing attention to some neglected arguments, this essay can help the rationalist/EA community track more possible futures and be ready for whatever may happen.
( The time spent writing this post was sponsored by the FTX Future Fund regranting program. Thanks to Simeon Campos for discussion and encouragement and Justis Mills from the LW team for help with editing)
Appendix: Value Stability and Boundaries
Epistemic status: pure, unbridled speculation
The optimal level of value stability plays a crucial role in the analysis above. What features of the environment and agents affect this optimal level? I conjecture being near a complex or novel boundary, either in physical or conceptual space, makes the optimal level of value stability lower; being far from complex, novel boundaries makes the optimal level higher.
By “being near a boundary” I mean having access to relatively unclaimed/virgin/unoptimized resources. In physical space this would be gaining access to some previously unoccupied area of space; for example a spacefaring civilization expanding into untouched solar systems. In conceptual space this is coming up with a novel class of useful ideas, for instance new processor designs or neural net architectures. By “qualitatively novel boundary” I mean a boundary that is not just adjacent to new resources/ideas, but resources/ideas configured in a different way from previous boundaries that the agents in question have encountered.
When near a boundary, fresh resources are plentiful, so agents there can, on average, gain in power/number of descendants. In places far from boundaries, where there is a fixed supply of resources, the average increase in power/descendants of a population of agents must be equal to one. Hence, agents near boundaries have more to gain from reckless expansion. Agents which quickly grab a lot of the new resources are selected for.
“Qualitatively novel” boundaries provide an additional pressure away from value stability in that their novelty makes it difficult to rigorously verify the behaviour of successors across them. A completely new class of mind architecture might promise great gains in capability, but make proving alignment harder. It may be harder for successors to coordinate in totally uncharted & unknown territory.
The property of “being a novel boundary” is not binary. The physical and conceptual landscapes are fractal, containing nested sub-divisions with their own boundaries. Agents will differ in what they consider to be ‘uncharted territory’—territory that has only been lightly exploited by one class of agents might appear optimal for expansion to a more sophisticated class. It seems plausible that the future will contain enough novel boundaries in conceptual and physical space to incentivize non-maximal value stability for a long subjective time.
TBF I’m not sure if I’ve seen anyone make this exact argument, at least in such a simple-minded way; nevertheless I think it’s an important background driver of intuitions so I’m including it
You might dispute that since ems share human values, they are in fact aligned with humanity, not just good successors. Here by aligned I mean “aligned with their human operators”, so a society of ems would not qualify if they decided to pursue their own interests rather than those of their operators.
This is not to say that either research agenda is only useful for creating good successor AI—the same insights could be useful for creating ‘traditional’ aligned AI as well.