AI alignment shouldn't be conflated with AI moral achievement

In this post I want to make a simple point that I think has big implications.

I sometimes hear EAs talk about how we need to align AIs to “human values”, or that we need to make sure AIs are benevolent. To be sure, ensuring AI development proceeds ethically is a valuable aim, but I claim this goal is not the same thing as “AI alignment”, in the sense of getting AIs to try to do what people want.

My central contention here is that if we succeed at figuring out how to make AIs pursue our intended goals, these AIs will likely be used to maximize the economic consumption of existing humans at the time of alignment. And most economic consumption is aimed at satisfying selfish desires, rather than what we’d normally consider our altruistic moral ideals.

It’s important to note that my thesis here is not merely a semantic dispute about what is meant by “AI alignment”. Instead, it is an empirical prediction about how people will actually try to use AIs in practice. I claim that people will likely try to use AIs mostly to maximize their own economic consumption, rather than to pursue ideal moral values.

Critically, only a small part of human economic consumption appears to be what impartial consequentialism would recommend, including the goal of filling the universe with numerous happy beings who live amazing lives.

Let me explain.

Consider how people currently spend their income. Below I have taken a plot from the blog Engaging Data, which borrowed data from the Bureau of Labor Statistics in 2019. It represents a snapshot of how the median American household spends their income.

Most of their money is spent on the type of mundane consumption categories you’d expect: housing, utilities, vehicles etc. It is very likely that the majority of this spending is meant to provide personal consumption for members of the household or perhaps other family and friends, rather than strangers. Near the bottom of the chart, we find that only 3.1% of this spending is on what we’d normally consider altruism: voluntary gifts and charity.

To be clear, this plot does not comprise a comprehensive assessment of the altruism of the median American household. Moreover, moral judgement is not my intention here. Instead, my intention is to emphasize the brute fact that when people are given wealth, they primarily spend it on themselves, their family, or their friends, rather than to pursue benevolent moral ideals.

This fact is important because, to a first approximation, aligning AIs with humans will simply have the effect of greatly multiplying the wealth of existing humans — i.e. the total amount of resources that humans have available to spend on whatever they wish. And there is little reason to think that if humans become extraordinarily wealthy, they will follow idealized moral values. To see why, just look at what current people already do, who are many times richer than their ancestors centuries ago. All that extra wealth did not make us extreme moral saints; instead, we still mostly care about ourselves, our family, and our friends.

Why does this fact make any difference? Consider the prescription of classical utilitarianism to maximize population size. If given the choice, humans would likely not spend their wealth to pursue this goal. That’s because humans care far more about our own per capita consumption than global aggregate utility. When humans increase population size, it is usually a byproduct of their desire to have a family, rather than being the result of some broader utilitarian moral calculation.

Here’s another example. When given the choice to colonize the universe, future humans will likely want a rate of return on their investment, rather than merely deriving satisfaction from the fact that humanity’s cosmic endowment is being used well. In other words, we will likely send out the von Neumann probes as part of a scheme to benefit ourselves, not out of some benevolent duty to fill the universe with happy beings.

Now, I’m not saying selfishness is automatically bad. Indeed, when channeled appropriately, selfishness serves the purpose of making people happy. After all, if everyone is rich and spends money on themselves, that’s not obviously worse than a situation in which everyone is rich and spends their money on each other.

But, importantly, humans are not the only moral patients who will exist.

Consider that the vast majority of humans are happy to eat meat, even if many of them privately confess that they don’t like causing animal suffering. To most people, the significant selfish costs of giving up meat simply outweigh the large non-selfish benefits of reducing animal suffering. And so, most people don’t give up meat.

The general pattern here is that, while most humans are not evil, whenever there’s a non-trivial conflict between selfish preferences and altruistic preferences, our selfish preferences usually trump the altruistic ones, even at the cost of great amounts of suffering. This pattern seems likely to persist into the future.

Of course, the fact that humans are primarily selfish doesn’t mean that humans will necessarily cause a ton of suffering in the future because — unlike with current meat consumption — it might one day become feasible to mitigate suffering without incurring substantial selfish costs.

At the same time, it’s critically important to avoid wishful thinking.

The mere possibility that in the future there might exist no tradeoff between suffering and economic consumption does not imply that will be the case. It remains plausible that humans in the future, equipped with aligned AIs, will produce vast amounts of suffering in the service of individual human preferences, just as our current society produces lots of animal suffering to satisfy current human wants. If true, the moral value of AI alignment is uncertain, and potentially net-negative.

As just one example of how things could go badly even if we solve AI alignment, it may turn out that enabling AIs to suffer enhances their productivity or increases the efficiency of AI training. In this case there would be a direct non-trivial tradeoff between the satisfaction of individual human preferences and the achievement of broad utilitarian ideals. I consider this scenario at least somewhat likely.

Ultimately I don’t think we should talk about AIs being aligned with some abstract notion of “human values” or AIs being aligned with “humanity as a whole”. In reality, we will likely try to align AIs with various individual people, who have primarily selfish motives. Aligned AIs are best thought of as servants who follow our personal wishes, whatever those wishes may be, rather than idealized moral saints who act on behalf of humanity, or all sentient life.

This does not mean that aligned AIs won’t follow moral constraints or human moral norms. Aligned AIs may indeed follow various constraints, including following the law. But following moral norms is not the same thing as being a moral saint: selfish people already have strong incentives to obey the law purely out of fear of punishment.

Crucially, the moral norms that aligned AIs follow will be shaped by the preferences of actual humans or society in general, rather than by lofty altruistic ideals. If AIs obey our moral norms, that does not simply imply they will be benevolent anymore than current laws constrain people’s ability to eat meat.

Can’t we just build benevolent AIs instead of AI servants that fulfill our selfish desires? Well, we could do that. But people would not want to purchase such AIs. When someone hires a worker, they generally want the worker to do work for them, not for others. A worker who worked for humanity as a whole, or for all sentient life, would be much less likely to be hired than someone who works directly for their employer, and does what they want. The same principle will likely apply to AIs.

To make my point clearer, we can try to distinguish what might be meant by “human values”. The concept can either refer to a broad moral ideal, or it can refer to the the preferences of actual individual humans.

In the first case, there will likely be little economic incentive to align AIs to human values, and thus aligning AIs to human values does not appear to be a realistic end-goal. In the second case, human values refer to the preferences of primarily selfish individual people, and satisfying these preferences is not identical to the achievement of broad, impartial moral goals.

Of course, it might still be very good to solve AI alignment. Unaligned AIs might have preferences we’d find even worse than the preferences of currently-living individual humans, especially from our own, selfish perspective. Yet my point is merely that the achievement of AI alignment not the same as the achievement of large-scale, altruistic moral objectives. The two concepts are logically and empirically separate, and there is no necessary connection between them.

AI alignment shouldn’t be conflated with AI moral achievement