Steven Byrnes comments on Clarifying two uses of “alignment”

Steven Byrnes 10 Mar 2024 18:18 UTC
19 points
3 ∶ 1
My terminology would be that (2) is “ambitious value learning” and (1) is “misaligned AI that cooperates with humans because it views cooperating-with-humans to be in its own strategic / selfish best interest”.
I strongly vote against calling (1) “aligned”. If you think we can have a good future by ensuring that it is always in the strategic / selfish best interest of AIs to be nice to humans, then I happen to disagree but it’s a perfectly reasonable position to be arguing, and if you used the word “misaligned” for those AIs (e.g. if you say “alignment is unnecessary”), I think it would be viewed as a helpful and clarifying way to describe your position, and not as a reductio or concession.
For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.” (And if the AI is not the kind of thing for which the word “trying” and “cares about” is applicable in the first place, then the AI is neither aligned nor misaligned, and also I’d claim it’s not an x-risk in any case.) More caveats in a thing I wrote here:
Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.
I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.
Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.
What links here?
- Ryan Greenblatt's comment on Clarifying two uses of “alignment” by Matthew_Barnett (10 Mar 2024 19:41 UTC; 8 points)
- Matthew_Barnett 10 Mar 2024 18:24 UTC
  6 points
  1 ∶ 0
  Parent
  
  For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.”
  
  This is a reasonable definition, but it’s important to note that under this definition of alignment, humans are routinely misaligned with each other. In almost any interaction I have with strangers—for example, when buying a meal at a restaurant—we are performing acts for each other because of mutually beneficial trade rather than because we share each other’s values.
  
  That is, humans are largely misaligned with each other. And yet the world does not devolve into a state of violence and war as a result (at least most of the time), even in the presence of large differences in power between people. This has epistemic implications for whether a world filled with AIs would similarly be peaceful, even if those AIs are misaligned by this definition.
  - Steven Byrnes 10 Mar 2024 18:54 UTC
    5 points
    2 ∶ 0
    Parent
    Humans are less than maximally aligned with each other (e.g. we care less about the welfare of a random stranger than about our own welfare), and humans are also less than maximally misaligned with each other (e.g. most people don’t feel a sadistic desire for random strangers to suffer). I hope that everyone can agree about both those obvious things.
    That still leaves the question of where we are on the vast spectrum in between those two extremes. But I think your claim “humans are largely misaligned with each other” is not meaningful enough to argue about. What percentage is “largely”, and how do we even measure that?
    Anyway, I am concerned that future AIs will be more misaligned with random humans than random humans are with each other, and that this difference will have important bad consequences, and I also think there are other disanalogies / reasons-for-concern as well. But this is supposed to be a post about terminology so maybe we shouldn’t get into that kind of stuff here.
  - Alexander Herwix 🔸 10 Mar 2024 18:39 UTC
    2 points
    1 ∶ 0
    Parent
    The difference is that a superintelligence or even an AGI is not human and they will likely need very different environments from us to truly thrive. Ask factory farmed animals or basically any other kind of nonhuman animal if our world is in a state of violance or war… As soon as strong power differentials and diverging needs show up the value cocreation narrative starts to lose it’s magic. It works great for humans but it doesn’t really work with other species that are not very close and aligned with us. Dogs and cats have arguably fared quite well but only at the price of becoming strongly adapted to OUR needs and desires.
    
    In the end, if you don’t have anything valuable to offer there is not much more you can do besides hoping for, or ideally ensuring, value alignment in the strict sense. Your scenario may work well for some time but it’s not a longterm solution.
    - Matthew_Barnett 10 Mar 2024 18:43 UTC
      5 points
      1 ∶ 0
      Parent
      Animals are not socially integrated in society, and we do not share a common legal system or culture with them. We did not inherit legal traditions from them. Nor can we agree to mutual contracts, or coordinate with them in a meaningful way. These differences seem sufficient to explain why we treat them very differently as you described.
      
      If this difference in treatment was solely due to differences in power, you’d need to explain why vulnerable humans are not regularly expropriated, such as old retired folks, or small nations.
      - Alexander Herwix 🔸 10 Mar 2024 19:33 UTC
        1 point
        0 ∶ 0
        Parent
        I have never said that how we treat nonhuman animals is “solely” due to differences in power. The point that I have made is that AIs are not humans and I have tried to illustrate that differences between species tend to matter in culture and social systems.
        
        But we don’t even have to go to species differences, ethnic differences are already enough to create quite a bit of friction in our societies (e.g., racism, caste systems, etc.). Why don’t we all engage in mutually beneficial trade and cooperate to live happily ever after?
        
        Because while we have mostly converging needs in a biological sense, we have different values and beliefs. It still roughly works out in the grand scheme of things because cultural checks and balances have evolved in environments where we had strongly overlapping values and interests. So most humans have comparable degrees of power or are kept in check by those checks and balances. That was basically our societal process of getting to value alignment but as you can probably tell by looking at the news, this process has not reached a satisfactory quality, yet. We have come far but it’s still a shit show out there. The powerful take what they can get and often only give a sh*t to the degree that they actually feel consequences from it.
        
        So, my point is that your “loose” definition of value alignment is an illusion if you are talking about super powerful actors that have divergent needs and don’t share your values. They will play along as long as it suits them but will stop doing it as soon as an alternative more aligned with their needs and values is more convenient. And the key point here is that AIs are not humans and that they have very different needs from us. If they become much more powerful than us, only their values can keep them in check in the long run.