Wei Dai comments on AI alignment shouldn’t be conflated with AI moral achievement

Wei Dai 30 Dec 2023 16:41 UTC
34 points
5 ∶ 1

To be sure, ensuring AI development proceeds ethically is a valuable aim, but I claim this goal is *not *the same thing as “AI alignment”, in the sense of getting AIs to try to do what people want.

There was at least one early definition of “AI alignment” to mean something much broader:

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

I’ve argued that we should keep using this broader definition, in part for historical reasons, and in part so that AI labs (and others, such as EAs) can more easily keep in mind that their ethical obligations/opportunities go beyond making sure that AI does what people want. But it seems that I’ve lost that argument so it’s good to periodically remind people to think more broadly about their obligations/opportunities. (You don’t say this explicitly, but I’m guessing it’s part of your aim in writing this post?)

(Recently I’ve been using “AI safety” and “AI x-safety” interchangeably when I want to refer to the “overarching” project of making the AI transition go well, but I’m open to being convinced that we should come up with another term for this.)

That said, I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general. The way most people form their morality is scary to me, and personally I would push humanity to be more philosophically competent/inclined before pushing it to be less selfish.
What links here?
- Matthew_Barnett 30 Dec 2023 23:46 UTC
  15 points
  1 ∶ 0
  Parent
  There was at least one early definition of “AI alignment” to mean something much broader:
  I agree. I have two main things to say about this point:
  - My thesis is mainly empirical. I think, as a matter of verifiable fact, that if people solve the technical problems of AI alignment, they will use AIs to maximize their own economic consumption, rather than pursue broad utilitarian goals like “maximize the amount of pleasure in the universe”. My thesis is independent of whatever we choose to call “AI alignment”.
  - Separately, I think the war over the semantic battle seems to be trending against those on “your side”. The major AI labs seem to use the word “aligned” to mean something closer to “the AI does what users want (and also respects moral norms, and doesn’t output harmful content etc.)” rather than “the AI produces positive outcomes in the world morally, even if this isn’t what the user wants”. Personally, the word “alignment” also just seems to conjure an image of the AI trying to do what you want, rather than fighting you if you decide to do something bad or selfish.
  That said, I think I’m less worried than you about “selfishness” in particular and more worried about moral/philosophical/strategic errors in general.
  There is a lot I could say about this topic, but I’ll just say a few brief things here. In general I think the degree to which moral reasoning determines the course of human history is frequently exaggerated. I think mundane economic forces are simply much more impactful. Indeed, I’d argue that much of what we consider human morality is simply a byproduct of social coordination mechanisms that we use to get along with each other, rather than the result of deep philosophical reflection.
  At the very least, mundane economic forces seem to have been more impactful historically compared to philosophical reasoning. I probably expect the future of society to resemble the past more strongly than you do?
  - Wei Dai 31 Dec 2023 0:45 UTC
    16 points
    1 ∶ 0
    Parent
    
    I think, as a matter of verifiable fact, that if people solve the technical problems of AI alignment, they will use AIs to maximize their own economic consumption, rather than pursue broad utilitarian goals like “maximize the amount of pleasure in the universe”.
    
    If you extrapolate this out to after technological maturity, say 1 million years from now, what does selfish “economic consumption” look like? I tend to think that people’s selfish desires will be fairly easily satiated once everyone is much much richer and the more “scalable” “moral” values would dominate resource consumption at that point, but it might just be my imagination failing me.
    
    I think mundane economic forces are simply much more impactful.
    
    Why does “mundane economic forces” cause resources to be consumed towards selfish ends? I think economic forces select for agents who want to and are good at accumulating resources, but will probably leave quite a bit of freedom in how those resources are ultimately used once the current cosmic/technological gold rush is over. It’s also possible that our future civilization uses up much of the cosmic endowment through wasteful competition, leaving little or nothing to consume in the end. Is that’s your main concern?
    
    (By “wasteful competition” I mean things like military conflict, costly signaling, races of various kinds that accumulate a lot of unnecessary risks/costs. It seems possible that you categorize these under “selfishness” whereas I see them more as “strategic errors”.)
    What links here?
    Long-term risks from ideological fanaticism by David_Althaus (12 Feb 2026 23:25 UTC; 197 points)
    Long-term risks from ideological fanaticism by David Althaus (LessWrong; 12 Feb 2026 23:26 UTC; 99 points)
    - Matthew_Barnett 31 Dec 2023 1:40 UTC
      8 points
      0 ∶ 1
      Parent
      Why does “mundane economic forces” cause resources to be consumed towards selfish ends?
      Because most economic agents are essentially selfish. I think this is currently true, as a matter of empirical fact. People spend the vast majority of their income on themselves, their family, and friends, rather than using their resources to pursue utilitarian/altruistic ideals.
      I think the behavioral preferences of actual economic consumers, who are not mostly interested in changing their preferences via philosophical reflection, will more strongly shape the future than other types of preferences. Right now that means human consumers determine what is produced in our economy. In the future, AIs themselves could become economic consumers, but in this post I’m mainly talking about humans as consumers.
      I tend to think that people’s selfish desires will be fairly easily satiated once everyone is much much richer and the more “scalable” “moral” values would dominate resource consumption at that point, but it might just be my imagination failing me.
      I think it’s currently very unclear whether selfish preferences can be meaningfully “satiated”. Current humans are much richer than their ancestors, and yet I don’t think it’s obvious that we are more altruistic than our ancestors, at least when measured by things like the fraction of our income spent on charity. (But this is a complicated debate, and I don’t mean to say that it’s settled.)
      It’s also possible that our future civilization uses up much of the cosmic endowment through wasteful competition, leaving little or nothing to consume in the end. Is that’s your main concern?
      This seems unlikely to me, but it’s possible. I don’t think it’s my main concern. My guess is that we still likely fundamentally disagree on something like “how much will the future resemble the past?”.
      On this particular question, I’d point out that historically, competition hasn’t resulted in the destruction of nearly all resources, leaving little to nothing to consume in the end. In fact, insofar as it’s reasonable to talk about “competition” as a single thing, competition in the past may have increased total consumption on net, rather than decreased it, by spurring innovation to create more efficient ways of creating economic value.
- Steven Byrnes 31 Dec 2023 1:01 UTC
  7 points
  0 ∶ 0
  Parent
  (Recently I’ve been using “AI safety” and “AI x-safety” interchangeably when I want to refer to the “overarching” project of making the AI transition go well, but I’m open to being convinced that we should come up with another term for this.)
  I’ve been using the term “Safe And Beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the overarching “go well” project, and “AGI safety” as the part where we try to make AGIs that don’t accidentally [i.e. accidentally from the human supervisors’ / programmers’ perspective] kill everyone, and (following common usage according to OP) “Alignment” for “The AGI is trying to do things that the AGI designer had intended for it to be trying to do”.
  (I didn’t make up the term “Safe and Beneficial AGI”. I think I got it from Future of Life Institute. Maybe they in turn got it from somewhere else, I dunno.)
  (See also: my post Safety ≠ alignment (but they’re close!))
  See also a thing I wrote here:
  Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
  I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
  Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.