AGI and Lock-In

Link post

The long-term future of intelligent life is currently unpredictable and undetermined. In the linked document, we argue that the invention of artificial general intelligence (AGI) could change this by making extreme types of lock-in technologically feasible. In particular, we argue that AGI would make it technologically feasible to (i) perfectly preserve nuanced specifications of a wide variety of values or goals far into the future, and (ii) develop AGI-based institutions that would (with high probability) competently pursue any such values for at least millions, and plausibly trillions, of years.

The rest of this post contains the summary (6 pages), with links to relevant sections of the main document (40 pages) for readers who want more details.

0.0 The claim

Life on Earth could survive for millions of years. Life in space could plausibly survive for trillions of years. What will happen to intelligent life during this time? Some possible claims are:

A. Humanity will almost certainly go extinct in the next million years.

B. Under Darwinian pressures, intelligent life will spread throughout the stars and rapidly evolve toward maximal reproductive fitness.

C. Through moral reflection, intelligent life will reliably be driven to pursue some specific “higher” (non-reproductive) goal, such as maximizing the happiness of all creatures.

D. The choices of intelligent life are deeply, fundamentally uncertain. It will at no point be predictable what intelligent beings will choose to do in the following 1000 years.

E. It is possible to stabilize many features of society for millions or trillions of years. But it is possible to stabilize them into many different shapes — so civilization’s long-term behavior is contingent on what happens early on.

Claims A-C assert that the future is basically determined today. Claim D asserts that the future is, and will remain, undetermined. In this document, we argue for claim E: Some of the most important features of the future of intelligent life are currently undetermined but could become determined relatively soon (relative to the trillions of years life could last).

In particular, our main claim is that artificial general intelligence (AGI) will make it technologically feasible to construct long-lived institutions pursuing a wide variety of possible goals. We can break this into three assertions, all conditional on the availability of AGI:

  1. It will be possible to preserve highly nuanced specifications of values and goals far into the future, without losing any information.

  2. With sufficient investments, it will be feasible to develop AGI-based institutions that (with high probability) competently and faithfully pursue any such values until an external source stops them, or until the values in question imply that they should stop.

  3. If a large majority of the world’s economic and military powers agreed to set-up such an institution, and bestowed it with the power to defend itself against external threats, that institution could pursue its agenda for at least millions of years (and perhaps for trillions).

Note that we’re mostly making claims about feasibility as opposed to likelihood. We only briefly discuss whether people would want to do something like this in Section 2.2.

(Relatedly, even though the possibility of stability implies E, in the top list, there could still be a strong tendency towards worlds described by one of the other options A-D. In practice, we think D seems unlikely, but that you could make reasonable arguments that any of the end-points described by A, B, or C are probable.)

Why are we interested in this set of claims? There are a few different reasons:

  • The possibility of stable institutions could pose an existential risk, if they implemented poorly chosen and insufficiently flexible values.

  • On the other hand, if we want humane values or institutions such as liberal democracy to survive in the long-run, some types of stability may be crucial for preserving them.

  • The possibility of ultra-stable institutions pursuing any of a wide variety of values, and the seeming generality of the methods that underlie them, suggest that significant influence over the long-run future is possible. This should inspire careful reflection on how to make it as good as possible.

We will now go over claims 1., 2., and 3., from above in more detail.

0.1 Preserving information

In the beginning of human civilization, the only way of preserving information was to pass it down from generation to generation, with inevitable corruption along the way. The invention of writing significantly boosted civilizational memory, but writing has relatively low bandwidth. By contrast, the invention of AGI would enable the preservation of entire minds. With whole-brain emulation (WBE), we could preserve entire human minds, and ask them what they would think about future choices. Even without WBE, we could preserve newly designed AGI minds that would give (mostly) unambiguous judgments of novel situations. (See section 4.1.)

Such systems could encode information about a wide variety of goals and values, for example:

  • Ensure that future civilisational decisions are made democratically.

  • Enforce a ban on certain weapons of mass destruction (WMD)

  • Make sure that reverence is paid to some particular religion.

  • Always do what some particular group of humans would have wanted.

Crucially, using digital error correction, it would be extremely unlikely that errors would be introduced even across millions or billions of years. (See section 4.2.) Furthermore, values could be stored redundantly across many different locations, so that no local accident could destroy them. Wiping them all out would require either (i) a worldwide catastrophe, or (ii) intentional action. (See section 4.3.)

0.2 Executing intentions

So let’s say that we can store nuanced sets of values. Would it be possible to design an institution that stays motivated to act according to those values?

Today, tasks can only be delegated to humans, whose goals and desires often differ from the goals of the delegator. With AGI, all tasks necessary for an institution’s survival could instead be automated, performed by artificial minds instead of biological humans. We will discuss the following 2 questions:

  • Will it be possible to construct AGI systems that (with high probability) are aligned with the intended values?

  • Will such systems stay aligned even over long periods of time?

0.2.1 Aligning AGI

Currently, humanity knows less about how to predict and control the behavior of advanced AI systems than about predicting and controlling the behavior of humans. The problem of how to control the behaviors and intentions of AI is commonly known as the alignment problem, and we do not yet have a solution to it.

However, there are reasons why it could eventually be far more robust to delegate problems to AGI, than to rely on (biological) humans:

  • With sufficient understanding of how to induce particular goals, AI systems could be designed to more single-mindedly optimize for the intended goal, whereas most humans will always have some other desires, e.g. survival, status, or sexuality.

  • AI behavior can be thoroughly tested in numerous simulated situations, including high-stakes situations designed to elicit problematic behavior.

  • AI systems could be designed for interpretability, perhaps allowing developers and supervisors to directly read their thoughts, and to directly understand how it would behave in a wide class of scenarios.

Thus, we suspect that an adequate solution to AI alignment could be achieved given sufficient time and effort. (Though whether that will actually happen is a different question, not addressed since our focus is on feasibility rather than likelihood.)

Note also that if we don’t make substantial progress on the alignment problem, but still keep building more AI systems that are more capable and more numerous, this could eventually lead to permanent human disempowerment. In other words, if this particular step of the argument doesn’t go through, the alternative is probably not a business-as-usual human world (without the possibility of stable institutions), but instead a future where misaligned AI systems are ruling the world.

(For more, see section 5.)

0.2.2 Preventing drift

As mentioned in section 0.1, digital error correction could be used to losslessly preserve the information content of values. But this doesn’t entirely remove the possibility of value-drift.

In order to pursue goals, AGI systems need to learn many facts about the world and update their heuristics of how to deal with new challenges and local contexts. Perhaps it will be possible to design AGI systems with goals that are cleanly separated from the rest of their cognition (e.g. as an explicit utility function), such that learning new facts and heuristics doesn’t change the systems’ values. But the one example of general intelligence we have — humans — instead seem to store their values as a distributed combination of many heuristics, intuitions, and patterns of thought. If the same is true for AGI, it is hard to be confident that new experiences would not occasionally cause their values to shift.

Thus, although it’s not clear how much of a concern this will be, we will discuss how an institution might prevent drift even if individual AI systems sometimes changed their goals. Possible options include:

  • Whenever there’s uncertainty about what to do in a novel situation, or a high-stakes decision needs to be made, the institution could boot-up a completely-reset version of an AI system (or a brain emulation) that acts according to the original values.

    • This system will have had no previous chance of value-drift, and so only needs to be informed about anything that is a prerequisite for judging the situation.

    • In order to reduce contingency from how these prerequisites are learned, the institution could bring back multiple copies and inform them in different ways — and also let some of the copies opine on how to inform the other copies. And then have them all discuss what the right option is.

  • AI systems designed to execute particular tasks could be motivated to do whatever the more thorough process would recommend. They could be extremely well-tested on the types of situations that most frequently come up while performing that task.

    • For any tasks that didn’t require high context over a long period of time, they could be frequently reset back to a well-tested state.

    • If the task did require a larger amount of context over a longer period of time, they could be supervised and frequently re-tested by other AI systems with less context. These may not be able to correctly identify the value of the supervisee’s every action, but they could prevent the supervisee from performing any catastrophic actions. (Especially with access to transparency tools that allow for effective mind-reading.)

  • Value drift that is effectively random could be eliminated by having a large number of AI systems with slightly-different backgrounds make an independent judgment about what the right decision is, and take the majority vote.

Some of these options might reveal inputs where AI systems systematically behave badly, or where it’s not clear if they’re behaving well or badly. For example, they might:

  • endorse options that less-informed versions of themselves disagree strongly with,

  • have irresolvable disagreements with AI systems which have somewhat different previous experiences,

  • exhibit thought-patterns (detected with transparency tools) that show doubt about the institutions’ original principles.

In most cases, it is probably the case that the reason for the discrepancy could be identified, and the AI design could be modified to act as desired. But it’s worth noting that even in situations where it remains unclear what the desired behavior is, or in situations where it’s somehow difficult to design a system that responds in the desired way, a sufficiently conservative institution could simply opt to prevent AI systems from being exposed to inputs like that (picking some sub-optimal but non-catastrophic resolution to any dilemmas that can’t be properly considered without those inputs).

  • An extreme version of this would be to prevent all reasoning that could plausibly lead to value-drift, halting progress in philosophy.

    • It doesn’t seem impossible that all philosophically ambitious institutions would eventually converge to some very similar set of behavior, from a very wide range of starting points. This might be the case if some form of moral realism holds, or perhaps if something like Evidential Cooperation in Large Worlds works. If this were the case, our claim that it’s feasible to stabilize many different value-systems would be false for philosophically ambitious institutions. It would only apply to institutions that refused to conduct some philosophical investigations. (Which we hope wouldn’t be very common.)

  • A further extreme would be for the institution to also halt technological progress and societal progress in general (insofar as it had the power to do that) to avoid any situation where the original values can’t give an unambiguous judgment.

    • This would largely eliminate the issue motivating this subsection — that continual learning could lead to value drift — since complete stagnation wouldn’t require much in the way of continual learning.

    • But depending on when technological progress was halted, this could limit the institution’s ability to survive in other ways, e.g. by preventing it from leaving Earth before its doom.

Given all these options, it seems more likely than not that an institution could practically eliminate any internal sources of drift that it wanted to. (For more, see section 6.)

0.3 Preventing disruption

So let’s say that it will remain mostly-unambiguous what an institution is supposed to do, in any given situation, and furthermore that the institution will keep being motivated to act that way.

Now, let’s consider a situation where this institution — at least temporarily — has uncontested military and economic dominance (let’s call this a “dominant institution”). Let’s also say that the institution’s goals include a consequentialist drive to maintain that dominance (at least instrumentally). Could the institution do this? On our best guess, the answer would be “yes” (with exceptions for encountering alien civilizations, and for the eventual end of usable resources).

Any resources, information, and agents necessary for the institution’s survival could be copied and stored redundantly across the Earth (and, eventually, other planets). Thus, in order to prevent the institution from rebuilding, an event would need to be global in scope.

As we argue in section 7, natural events of civilization-threatening magnitude are rare, and the main mechanism they have to pose a global threat to human civilization is that they would throw up enough dust to blot out the sun for a few years. A well-prepared AI civilization could easily survive such events by having energy sources that don’t depend on the sun. In a few billion years, the expansion of the Sun will prevent further life on Earth, but a technologically sophisticated stable institution could avoid destruction by spreading to space.

As we argue in section 8, a dominant institution could also prevent other intelligent actors from disrupting the institution. Uncontested economic dominance would allow the institution to manufacture and control loyal AGI systems that far outnumber any humans or non-loyal AI systems. Thus, insofar as any other actors could pose a threat, it would be economically cheap to surveil them as much as necessary to suppress that possibility. In practice, this could plausibly just involve enough surveillance to:

  • prevent others from building weapons of mass destruction,

  • prevent others from building a competitive institution of similar economic or military strength, and

  • prevent others from leaving the institution’s domain by colonizing uninhabited parts of space.

The main exception to this is alien civilizations, which could at first contact already be more powerful than the Earth-originating institution.

Ultimately, the main boundaries to a stable, dominant institution would be (i) alien civilizations, (ii) the eventual end of accessible resources predicted by the second law of thermodynamics, and (iii) any disruptive Universe-wide physical events (such as a Big Rip scenario), although to our knowledge no such events are predicted by standard cosmology.

0.4 Some things we don’t argue for

To be clear, here are two things that we don’t argue for:

First, we don’t think that the future is necessarily very contingent, from where we stand today. For example, it might be the case that almost no humans would make an ultra-stable institution that pursues a goal that those humans themselves couldn’t later change (if they changed their mind). And it might be the case that most humans would eventually end up with fairly similar ideas about what is good to do, after thinking about it for a sufficiently long time.

Second, we don’t think that extreme stability (of the sort that could make the future contingent on early events) would necessarily require a lot of dedicated effort. The options for increasing stability we sketch in sections 0.2.2 and 6 and the assumption of a singleton-like entity in sections 0.3 and 8 are brought up to make the point that stability is feasible at least in those circumstances. It seems plausible that they wouldn’t be necessary in practice. Perhaps stability will only require a smaller amount of effort. Perhaps the world’s values would stabilize by default given the (not very unlikely) combination of:

  • technological maturity (preventing new technologies from shaking things up),

  • human immortality (reducing drift from generational changes),

  • the ability to cheaply and stably align AGI systems with any goal, and

  • such AI systems being equally good at pursuing instrumental goals regardless of what terminal goals they have. (Thereby mostly eliminating the tendency for some values to outcompete others, c.f. decoupling deliberation from competition.)

0.5 Structure of the document

Readers should feel free to skip to whatever parts they’re interested in. (See also the table of contents.)


Lukas Finnveden was the lead author. Some parts of this document started as an unfinished report prepared by Jess Riedel while he was an employee at Open Philanthropy. Carl Shulman contributed many of the ideas, and both Jess and Carl provided multiple rounds of comments. Lukas did most of the work while he was part of the Research Scholars Programme at the Future of Humanity Institute (although at the time of publishing, he works for Open Philanthropy). All views are our own.