Clarifying two uses of “alignment”

Paul Christiano once clarified AI alignment as follows:

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

This definition is clear enough for many purposes, but it leads to confusion when one wants to make a point about two different types of alignment:

  1. A is trying to do what H wants it to do because A is trading or cooperating with H on a mutually beneficial outcome for the both of them. For example, H could hire A to perform a task, and offer a wage as compensation.

  2. A is trying to do what H wants it to do because A has the same values as H — i.e. its “utility function” overlaps with H’s utility function — and thus A intrinsically wants to pursue what H wants it to do.

These cases are important to distinguish because they have dramatically different consequences for the difficulty and scope of alignment.

To solve alignment in the sense (1), A and H don’t necessarily need to share the same values with each other in any strong sense. Instead, the essential prerequisite seems to be for A and H to operate in an environment in which it’s mutually beneficial to them to enter contracts, trade, or cooperate in some respect.

For example, one can imagine a human hiring a paperclip maximizer AI to perform work, paying them a wage. In return the paperclip maximizer could use their wages to buy more paperclips. In this example, the AI performed their duties satisfactorily, without any major negative side effects resulting from their differing values, and both parties were made better off as a result.

By contrast, alignment in the sense of (2) seems far more challenging to solve. In the most challenging case, this form of alignment would require solving extremal goodhart, in the sense that A’s utility function would need to be almost perfectly matched with H’s utility function. Here, the idea is that even slight differences in values yield very large differences when subject to extreme optimization pressure. Because it is presumably easy to make slight mistakes when engineering AI systems, by assumption, these mistakes could translate into catastrophic losses of value.

Effect on alignment difficulty

My impression is that people’s opinions about AI alignment difficulty often comes down to differences in how much they think we need to solve the second problem relative to the first problem, in order to get AI systems that generate net-positive value for humans.

If you’re inclined towards thinking that trade and compromise is either impossible or inefficient between agents at greatly different levels of intelligence, then you might think that we need to solve the second problem with AI, since “trading with the AIs” won’t be an option. My understanding is that this is Eliezer Yudkowsky’s view, and the view of most others who are relatively doomy about AI. In this frame, a common thought is that AIs would have no need to trade with humans, as humans would be like ants to them.

On the other hand, you could be inclined — as I am — towards thinking that agents at greatly different levels of intelligence can still find positive sum compromises when they are socially integrated with each other, operating under a system of law, and capable of making mutual agreements. In this case, you might be a lot more optimistic about the prospects of alignment.

To sketch one plausible scenario here, if AIs can own property and earn income by selling their labor on an open market, then they can simply work a job and use their income to purchase whatever it is they want, without any need to violently “take over the world” to satisfy their goals. At the same time, humans could retain power in this system through capital ownership and other grandfathered legal privileges, such as government welfare. Since humans may start out with lots of capital, these legal rights would provide a comfortable retirement for us.

In this scenario, AIs would respect the legal rights of humans for both cultural and pragmatic reasons. Culturally, AIs would inherit our norms, legal traditions, and social conventions. It would be unpopular to expropriate human wealth just as it’s now unpopular to expropriate the wealth of old people in our current world, even though in both cases the relevant entities are physically defenseless. Pragmatically, AIs would also recognize that stealing wealth from humans undermines the rule of law, which is something many AIs (as well as humans) would not like.

A threat to the rule of law is something many agents would likely coordinate to avoid, as it would erode the predictable and stable environment they rely on in order to make long-term plans, and keep the peace. Furthermore, since AIs would “get old” too, in the sense of becoming obsolete in the face of new generations of improved AIs, they could also have reason to not collectively expropriate the wealth of vulnerable old agents because they too will be in such a vulnerable position one day, and thus would prefer not to establish a norm of expropriating the type of agent they may one day become.

If an individual AI’s relative skill-level is extremely high, then this could simply translate into higher wages for them, obviating the need for them to take part in a violent coup to achieve their objectives. In other words, there’s really no strong incentive for AIs — even if they’re super powerful — to try to kill or steal to get what they want, since peaceful strategies could be equally, or even more effective at accomplishing their aims. Power-seeking AIs could simply lawfully accumulate wealth instead, with no loss of value from their own unaligned perspective.

Indeed, attempting to take over the world is generally quite risky, as the plan could fail, and you could thereby die or be subject to legal penalties in the aftermath. Even designing such a takeover plan is risky, as it may be exposed prematurely, and this possibility becomes more likely the more allies that you need to recruit to ensure a successful execution. Moreover, war is generally economically inefficient compared to trade, suggesting that compromise is usually a better option than attempting violent takeover for rational, well-informed agents in society.

These facts suggest that even if taking over the world is theoretically possible for some set of agents, the expected value of pursuing such a plan could be inferior to simply compromising with other agents in society on terms that are beneficial to both sides of the compromise. This conclusion becomes even stronger if, in fact, there’s no way for a small set of agents to take over the entire world and impose their will on everyone else.

My sketched scenario provides for an optimistic assessment of alignment by lowering the bar for what counts as “aligned enough”. Some degree of cultural assimilation, social integration, and psychological orientation towards seeking compromise and avoiding violence may still be necessary for AIs to treat humans reasonably well. But under these assumptions, it is unnecessary for AIs to share our exact goals.

Of course, in this scenario, it would still be nice if AIs cared about exactly what we cared about; but even if they don’t, we aren’t necessarily made worse off as a result of building them. If they share our preferences, that would simply be a nice bonus for us. The future could still be bright for humans even if the universe is eventually filled with entities whose preferences we do not ultimately share.