Summary

A previous post argued that work on Alignment Target Analysis (ATA) needs to start now in order to reduce the probability of a bad alignment target getting successfully implemented. The present post is focused on one specific scenario where starting ATA work now would reduce the probability of disaster. Different types of time crunch can lead to situations where there will not be much time to do ATA later. We will make some optimistic assumptions in order to focus on one specific type of time crunch that remains despite these assumptions. We will assume a human augmentation project that leads to smarter humans. Then we assume the creation of a Limited AI (LAI) that removes all external time pressure from competing AI projects by uploading the augmented humans. Then we describe a scenario where value drift and internal power struggles leads to internal time pressure. One faction takes a calculated risk and successfully hits an alignment target, despite the fact that this alignment target has never been properly analysed.

The scenario in brief: after creating an LAI that uploads them, the latest and most advanced generation of augmented humans realise that they have undergone value drift. This gives them an incentive to act fast. The earlier generations might discover the value drift at any time and disempower them. Their only chance of being in charge of a Sovereign AI project is thus to launch before the value drift is discovered. In other words: unless the latest generation act quickly, they permanently lose their ability to influence numerous decisions along the lines of (i): extrapolation definitions, (ii): how to resolve disagreements amongst individuals that disagree on how to resolve disagreements, (iii): how to structure add ons along the lines of a last judge off switch, etc.

We will also argue that augmented humans with an increased ability to hit alignment targets will not necessarily be good at analysing alignment targets (these are two very different types of skills). This means that an alignment target might get successfully implemented without ever being properly analysed. That would be dangerous, because even serious flaws in well known alignment targets can go undetected for a long time. The most recently published version of CEV is Parliamentarian CEV (PCEV). It turns out that PCEV gives a large amount of extra influence to humans that intrinsically value hurting other humans (search the CEV arbital page for ADDED 2023 for Yudkowsky’s description of the issue). An AI Sovereign dominated by such people would be very dangerous. This issue went unnoticed for a long time.

What happened with PCEV shows that (i): ATA is difficult and risks from bad alignment targets getting successfully implemented are serious, and (ii): reducing these risks is a tractable research project (because risks can be reduced without finding any good alignment target: simply describing this feature of PCEV presumably removed most of the risk from scenarios where PCEV is successfully implemented). However, there does not exist a single research project dedicated to ATA. The post concludes by arguing that this neglect is a serious mistake.

Thanks to Chi Nguyen for giving great feedback on an earlier version of this post.

The scenario

Some AI proposals are based on the idea of building an AI that buys time. These proposed AIs are not supposed to reorganise the world. They are instead supposed to perform some Pivotal Act that will remove time pressure from the designers, so that they can take their time while designing another AI. Let’s write Limited AI (LAI) for an AI that is designed to do something that reduces external time pressure. This section describes a scenario, designed to show that an LAI that removes all external time pressure might not buy a lot of time due to internal time pressure. The present post elaborates on a scenario that was briefly mentioned in an earlier post, where it was one part of a comprehensive argument for Alignment Target Analysis (ATA) being urgent (the previous post also covered topics such as AI assistants helping with ATA).

Now let’s make some very optimistic assumptions, so that we can focus on an issue that remains despite these assumptions. Consider a scenario where a group of augmented humans keep improving some augmentation method. Each new version of the augmentation method is only used on a quarter of the augments (as a precaution against side effects that are not noticed right away). Eventually these augments succeed in launching an LAI that uploads them and gives them infinite time to work (permanently and completely removing the threat from all competing AI projects, without needing to interfere with those projects. In other words: completely removing all external time pressure). Let’s also assume that it is somehow common knowledge that the LAI will ensure that everyone stays calm / rational / sane / etc indefinitely (with all related definitional issues known to be fully solved). The LAI also removes all risks related to failure to hit the alignment target being aimed at. If a majority of the augments vote yes, then the LAI will allow them to launch a Sovereign AI. A majority vote can also disenfranchise augments (so that they can deal with augmentation method side effects).

At upload time there exists a popular Sovereign AI proposal that has been around for a while. No one has found any serious problems with it. The original plan was to continue augmenting after uploading, and to take a lot of time to analyse this proposal in depth before deciding what to do next. When the uploaded augments have had some time to reflect, the latest generation realises that the latest version of the augmentation method has resulted in value drift. The others could discover this at any time and disenfranchise the latest generation. If the augmentation project were to proceed, then the next generation would probably experience some difficult-to-predict form of value drift.

Even though all external time pressure has been fully removed, there still exists an internal time pressure: an internal group dynamic that generates an incentive to act quickly. In other words: the most advanced mind in existence now has an incentive to convince the others to move forwards as fast as possible. The rest of this section will argue that a group of value drifted augments might have a strong reason to maintain control (if they don’t act fast, they loose control to people with different values). And it will argue that they might succeed at convincing the others to move quickly (they are the most advanced mind around). The next section will argue that this is dangerous. And that ATA work done now can reduce this danger.

The original plan was that they would continue to augment themselves until they felt ready to launch a Sovereign AI. But if this plan is followed, the latest generation would lose influence over the many definitions and other decisions involved in any Sovereign AI project. These value drifted augments would for example be unable to ensure the protection of things that only they care about (for example by designing the rules for something along the lines of a last judge off switch). Thus, the latest generation has an incentive to take a calculated risk, and launch a Sovereign AI as soon as possible. If they move forwards quickly, they know that they face the risk that the Sovereign AI proposal has a flaw that no one has noticed. But if they wait, they loose control to people with different values.

It is not certain that a group of augmented humans would take the risk of moving forwards in this situation. They could decide to come clean about what happened and hope that the outcome would still be acceptable to them. They would know that moving ahead quickly is risky. But if some specific alignment target is seen as reasonably likely to work as planned, then this risk might be viewed as smaller than the risks associated with losing control over implementation details. Especially if one effect of the value drift was to make them care intrinsically about some thing that no one else cares about.

To make this hypothetical more concrete, let’s say that they feel a strong aversion towards any scenario where anyone implements any technical plan, that has some specific set of features (for example features associated with a lack of robustness). When they reflect on their values, they realise that they have started to consider this to be intrinsically bad. Not just instrumentally bad. They want to prevent all scenarios where anyone follows any such plan, regardless of context (including in contexts where there is no harm in failing, and where trial and error would be fun).

(Readers that would like to make this situation more intuitive may try a perspective shift. First imagine some specific thing that is important to you. Now imagine that only you and a small group of other people care about this thing. You are part of a project that plans to build an AI that will get its goal from billions of less advanced minds, that do not consider this specific thing to be relevant at all (it might for example work to imagine a project that will build an AI that does what billions of chimpanzees wants the AI to do). Perhaps this perspective shift can help illustrate why the latest generation might be prepared to take a calculated risk, rather than giving up control to people with different values. (from the perspective of a value drifted augment generation, ordinary humans might be the ones that are uncaring in a dangerous way).)

If they do decide to push for a quick launch, they might turn out to be very good at making this happen. They are the result of an augmentation method that selected for a set of abilities that are useful for getting things done. Even if it was not selected for on purpose, persuasion ability will presumably have been de facto selected for (for example as a side effect of selecting for the ability to come up with important novel ideas. It is not easy to distinguish between (i): the ability to explain why an important novel idea is in fact important, and (ii): general persuasion ability). So they might be very capable of getting their Sovereign AI project implemented. While they are not able to achieve arbitrary ends with fully general manipulation techniques, they might be in a good position to convince the others to move forwards with a pre existing Sovereign AI proposal.

One obvious option is to base their persuasion campaign on the idea that waiting is dangerous. They might for example say that interactions with the LAI that runs their upload environment comes with manipulation risks (which might very well be a genuine problem). They might even point out that every new generation comes with a serious danger: value drift might produce a powerful mind with alien values, that might be able to hide this value drift (which in this thought experiment is definitely a real problem). More generally: when a novel type of powerful mind (that is the most advanced mind around), has a strong incentive to get something implemented fast, it is difficult to rule out scenarios where they succeed.

Why this would be dangerous

The danger is that the alignment target in question might have a hidden flaw. One might think that since these augmented humans were smart enough to successfully implement the LAI, then they must be smart enough to notice any flaw in the Sovereign AI proposal (at least if the flaw is humanly findable). In other words: one might think that if these augments are good at hitting alignment targets, then they must also be good at Alignment Target Analysis (ATA).

It is possible that improving the ability to hit an alignment target will dramatically improve the ability to notice flaws in proposed alignment targets. In other words: it is possible that these augmented humans will be so good at ATA, that they will have no use for ATA work done by non-augmented humans. But this is by no means guaranteed. Hitting an alignment target and analysing an alignment target are two very different types of skills. (Putting a man on the moon is one type of skill. Predicting what effects such a project would have on society is a very different type of skill). It is also possible that there are tradeoffs (so that selecting for one type of skill selects against the other type of skill).

One way of thinking might be very useful for designing a technical plan from scratch that will actually work. But that same way of thinking might be counterproductive when trying to find unexamined implicit assumptions in an existing alignment target proposal. One way of doing things is to build systems incrementally from scratch (steadily building towards a known target behaviour by incrementally adding well understood components). An alternative way of doing things is to sketch out lots of complete proposals and then check them for flaws. It could be that minds for whom the former strategy intuitively sounds like the way things should be done, are well suited for hitting alignment targets. But minds for whom the latter strategy intuitively sounds like the way things should be done, are well suited for noticing flaws in existing alignment target proposals. In this case, selecting for ability to hit alignment targets selects against ability to do ATA (because it selects for minds for whom the former way of doing things sounds like the way things should be done. And thus selects against minds for whom the latter way of doing things sounds like the way things should be done).

More generally: some features of a mind might be good for one ability, but bad for the other ability. If that is the case, then selecting for an ability to hit an alignment target might select against an ability to do ATA.

As a separate issue: even if the augmentation method does turn out to increase the ability to do ATA, this might not be enough to make any of them better than the best baseline humans. The best out of a small population of augmented humans might still not be as good at ATA as the best out of billions of baseline humans. Finally, even if they end up better than any baseline human at ATA under ideal conditions, this does not automatically result in de facto better performance. If they find themselves under pressure, they might never actually perform as well as the best baseline humans would perform (if those baseline humans are focused on doing ATA).

More generally: there is no particular reason to think that the augments in the above scenario would be able to make significant ATA progress in the time that they have available. Which means that this scenario might lead to an alignment target getting implemented, despite having a flaw that could have been caught by non augmented humans doing ATA (in other words: it means that doing ATA now reduces the probability of disaster).

As shown in a previous post, successfully hitting a bad alignment target can be very dangerous. In brief: the most recently published version of CEV is Parliamentarian CEV (PCEV). It turns out that PCEV gives a very large advantage to individuals that intrinsically value hurting other individuals. Those that want to inflict a lot of harm get a bigger advantage than those that want to inflict less harm. The largest possible advantage is given to groups that want the AI to hurt everyone else as much as possible. The fact that PCEV would be dominated by this type of people means that a successfully implemented PCEV would be massively worse than extinction.

This issue went undetected for many years, despite PCEV being a fairly prominent proposal (PCEV is the version of CEV that is on the CEV arbital page). So, even if the alignment target that the augments in the above scenario decides to aim at has been around for a while, it might still suffer from an undetected flaw. A flaw that could have been detected by baseline humans doing ATA. If that flaw is detected in time, the latest augment generation might accept the loss of control rather than rush things (at least if the flaw is serious enough). But if the flaw is not detected in time, they might instead take the calculated risk of moving ahead.

One might wonder why this post describes a specific scenario in such detail (given that every detail makes the scenario less likely). The main reason is that without the details, some readers might conclude that the described situation would not actually lead to an alignment target getting successfully implemented without being properly analysed. To conclude that it is safe to stay at the current level of ATA progress, one has to be confident that one has predicted and prevented every scenario like this (every scenario that leads to an alignment target getting successfully implemented without being properly analysed). Saying that current levels of ATA progress is safe, is equivalent to saying that no scenario like the one above exists. Thus, outlining one such scenario is a refutation of this safety claim. It is however possible to come up with any number of specific scenarios. To conclude that our current levels of ATA progress is safe, one has to first describe all of these paths. And then one has to reliably prevent all of them.

In other words: there exists a more general problem that this specific scenario is meant to illustrate. To conclude that it is safe to stay at current levels of ATA progress, one would need to deal with this more general problem. Basically: there exists a large number of hard-to-predict paths that ends in an alignment target getting successfully implemented, even though it suffers from a realistically findable flaw. Combined with the fact that risk mitigation has been shown to be tractable, it seems like a mistake to act based on the assumption that these risks do not need to be mitigated.

Conclusion

A previous post outlined a comprehensive case for Alignment Target Analysis (ATA) being urgent. The present post elaborated on one specific scenario from that post: a Limited AI (LAI) removes external time pressure from competing AI projects, but still fail to buy a lot of time for ATA. The scenario illustrated a general problem: internal time pressure. Power struggles amongst whoever end up in charge of an LAI might lead to an alignment target getting successfully implemented without ever being properly understood (because someone takes a calculated risk). The field of ATA is still at a very early stage. And there does not exist a single research project dedicated to ATA. So if an LAI leads to a situation with internal time pressure, then there is no reason to think that the field will have advanced much from its current state. While the post focused on one specific set of circumstances, the issue with internal time pressure is a general problem.

Let’s briefly look at another scenario. Consider an LAI that is instead under the control of a large population of ordinary humans. A large but shrinking majority might act before they lose the ability to act. Let’s say that a two thirds majority in a referendum is needed for the LAI to permit the launch of a Sovereign AI. A majority currently has enough support for a specific alignment target. However, a minority of people with different values continues to grow every year (due to ordinary political dynamics). The majority might now decide to launch their favoured type of AI Sovereign before they lose the ability to do so. (In this scenario, the shrinking majority and the growing minority favour different alignment targets due to well known value differences. In other words: in this scenario, the time crunch arise for reasons unrelated to things such as hidden value changes and the wish to influence implementation details. But the basic dynamic is the same: there is an incentive to take a calculated risk and act decisively, before losing control to people with different values). See also section 3 of this comment.

ATA as a risk mitigation tool is tractable, because ATA does not need to result in a good alignment target in order to mitigate risks. Besides noticing problems with specific classes of proposals, one potential risk mitigation tool is to identify features that are necessary. A necessary feature can reduce risks even if it is far from sufficient. Even if it is not always clear whether or not a given proposal can be reasonably described as having the feature in question, identifying it as necessary can still be useful. Because this makes it possible to rule out those proposals that are clearly not describable as having the feature. The role that such a feature can play was discussed in a previous post (in the context of Membrane formalisms).

It seems like there exists a wide range of reasons for why many people believe that it is safe to stay at our current level of ATA progress. Previous posts have discussed specific such reasons related to Corrigibility, the last judge idea, and other LAI proposals. If anyone has a reason for believing that staying at our current level of ATA progress is safe (that is not covered by the above posts), then it would be greatly appreciated if those reasons were to be described somewhere. Finally: I’m trying to understand people who act based on the assumption that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). Please don’t hesitate to contact me if you have any theories, observations, or questions related to this.

(I am also posting this on LessWrong)

A Pivotal Act AI might not buy a lot of time

Summary

The scenario

Why this would be dangerous

Conclusion