AI alignment with humans… but with which humans?
Caveat: This post probably raises a naive question; I assume there’s at least a 70% chance it’s been considered (if not answered) exhaustively elsewhere already; please provide links if so. I’ve studied evolutionary psych & human nature for 30 years, but am a relative newbie to AI safety research. Anyway....
When AI alignment researchers talk about ‘alignment’, they often seem to have a mental model where either (1) there’s a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there’s all 7.8 billion humans that the AI system should be aligned with, so it doesn’t impose global catastrophic risks. In those relatively simple cases, I could imagine various current alignment strategies, such as cooperative inverse reinforcement learning (CIRL) being useful, or at least a vector in a useful direction.
However, there are large numbers of intermediate-level cases where an AI system that serves multiple humans would need to become aligned with diverse groups of users or subsets of humanity. And within each such group, the humans will have partly-overlapping but partly-conflicting interests.
Example 1: a smart home/domestic robot AI might be serving a family consisting of a mom, a dad, an impulsive teenage kid, a curious toddler, and an elder grandparent with Alzheimer’s. Among these five humans, whose preferences should the AI try to align with? It can’t please all of them all the time. They may have genuinely diverging interests and incommensurate preferences. So it may find itself in much the same position as a traditional human domestic servant (maid, nanny, butler) trying to navigate through the household’s minefield of conflicting interests, hidden agendas, family dramas, seething resentments, etc. Such challenges, of course, provide much of the entertainment value and psychological complexity of TV series such as ‘Downtown Abbey’, or the P.G. Wodehouse ‘Jeeves’ novels.
Example 2: a tactical advice AI might be serving a US military platoon deployed near hostile forces, doing information-aggregation and battlefield-simulation services. The platoon includes a lieutenant commanding 3-4 squads, each with a sergeant commanding 6-10 soldiers. The battlefield also includes a few hundred enemy soldiers, and a few thousand civilians. Which humans should this AI be aligned with? The Pentagon procurement office might have intended for the AI to maximize the likelihood of ‘victory’ while minimizing ‘avoidable casualties’. But the Pentagon isn’t there to do the cooperative inverse reinforcement learning (or whatever preference-alignment tech the AI uses) with the platoon. The battlefield AI may be doing its CIRL in interaction with the commanding lieutenant and their sergeants—who may be somewhat aligned with each other in their interests (achieve victory, avoid death), but who may be quite mis-aligned with each other in their specific military career agendas, family situations, and risk preferences. The ordinary soldiers have their own agendas. And they are all constrained, in principle, by various rules of engagement and international treaties regarding enemy combatants and civilians—whose interests may or may not be represented in the AI’s alignment strategy.
Examples 3 through N could include AIs serving various roles in traffic management, corporate public relations, political speech-writing, forensic tax accounting, factory farm inspections, crypto exchanges, news aggregation, or any other situation where groups of humans affected by the AI’s behavior have highly divergent interests and constituencies.
The behavioral and social sciences focus on these ubiquitous conflicts of interest and diverse preferences and agendas that characterize human life. This is the central stuff of political science, economics, sociology, psychology, anthropology, and media/propaganda studies. I think that to most behavioral scientists, the idea that an AI system could become aligned simultaneously with multiple diverse users, in complex nested hierarchies of power, status, wealth, and influence, would seem highly dubious.
Likewise, in evolutionary biology, and its allied disciplines such as evolutionary psychology, evolutionary anthropology, Darwinian medicine, etc., we use ‘mid-level theories’ such as kin selection theory, sexual selection theory, multi-level selection theory, etc to describe the partly-overlapping, partly-divergent interests of different genes, individuals, groups, and species. The idea that AI could become aligned with ‘humans in general’ would seem impossible, given these conflicts of interest.
In both the behavioral sciences and the evolutionary sciences, the best insights into animal and human behavior, motivations, preferences, and values often involve some game-theoretic modeling of conflicting interests. And ever since von Neumann and Morgenstern (1944), it’s been clear that when strategic games include lots of agents with different agendas, payoffs, risk profiles, and choice sets, and they can self-assemble into different groups, factions, tribes, and parties with shifting allegiances, the game-theoretic modeling gets very complicated very quickly. Probably too complicated for a CIRL system, however cleverly constructed, to handle.
So, I’m left wondering what AI safety researchers are really talking about when they talk about ‘alignment’. Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI’s company’s legal team says would impose the highest litigation risk?
I don’t have any answers to these questions, but I’d value your thoughts, and links to any previous work that addresses this issue.
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:51 UTC; 80 points) (LessWrong;
- 21 Feb 2024 12:57 UTC; 56 points) 's comment on David_Althaus’s Quick takes by (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:53 UTC; 49 points) (
- Updated ‘Psychology of EA’ course: reading, videos, and syllabus by 28 Apr 2023 20:43 UTC; 45 points) (
- EA’s brain-over-body bias, and the embodied value problem in AI alignment by 21 Sep 2022 18:55 UTC; 45 points) (
- AI alignment with humans… but with which humans? by 9 Sep 2022 18:21 UTC; 12 points) (LessWrong;
- The heterogeneity of human value types: Implications for AI alignment by 23 Sep 2022 17:03 UTC; 10 points) (LessWrong;
- Brain-over-body biases, and the embodied value problem in AI alignment by 24 Sep 2022 22:24 UTC; 10 points) (LessWrong;
(Edit: Accidentally posted a duplicate link.)
Aligned with whom? by Anton Korinek and Avital Balwit (2022) has a possible answer. They write that an aligned AI system should have
direct alignment with its operator, and
social alignment with society at large.
Some examples of failures in direct and social alignment are provided in Why we need a new agency to regulate advanced artificial intelligence: Lessons on AI control from the Facebook Files (Korinek, 2021).
We could expand the moral circle further by aligning AI with the interests of both human and non-human animals. Direct, social and sentient alignment?
As you mentioned, these alignments present conflicting interests that need mediation and resolution.
The closest thing that comes to mind is Critch’s work on multi-user alignment, e.g. What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs).
Here are a couple of other links that come to mind:
https://arxiv.org/abs/2008.02275
https://www.brookings.edu/research/aligned-with-whom-direct-and-social-goals-for-ai-systems/
I agree with Miller’s response to mic (6-mos ago). Is it even possible for us to stop avoiding the “hard problem” of human nature?
Also, any given agent’s or interest group’s priorities and agendas always will be dynamic, compounding the problem of maintaining multiple mutually satisfactory alignments. Natural selection has designed us to exhibit complex contingent responsiveness to both subtle and dramatic environmental contingencies.
In addition, the humans providing the feedback, even if THEY can find sustainable alignment amongst themselves (remember they are all reproductive competitors, and deeply programmed by natural selection to act accordingly, intentionally or not), will change over time, possibly a very short time, by being exposed to such power. They will be corrupted by that power, and in due time corrupted absolutely.
Finally, an important Darwinian sub-theory has to do with the problem of nonconscious self-deception. I have to wonder whether even our discussing the possibility of properly managing substantive AI systems is just a way of obscuring from ourselves the utter impossibility, given human nature, of managing them properly, morally, wisely? Are all these conversations a way (a kind of competition) to convince ourselves and others that we or our (perceived) allies deserve the power to program and maintain these systems?
Linyphia—totally agree (unsurprisingly!).
You raise good additional points about the dynamism and unpredictability of human values and preferences. Some of that unpredictability may reflect adaptive unpredictability (what biologists call ‘protean behavior’) that makes it harder for evolutionary enemies and rivals to predict what one’s going to do next. I discuss this issue extensively in this 1997 chapter and this 1996 simulation study. Insofar as human values are somewhat adaptively unpredictable by design, for good functional reasons, it will be very hard for reinforcement learning systems to get a good ‘fix’ on our preferences.
The other issues of adaptive self-deception (e.g. virtue signaling, as discussed in my 2019 book on the topic) about our values, and about AI power corrupting humans, also deserve much more attention in AI alignment work, IMHO.
I think AI alignment isn’t really about designing AI to maximize for the preference satisfaction of a certain set of humans. I think an aligned AI would look more like an AI which:
is not trying to cause an existential catastrophe or take control of humanity
has had undesirable behavior trained out or adversarially filtered
learns from human feedback about what behavior is more or less preferable
In this case, we would hope the AI would be aligned to the people who are allowed to provide feedback
has goals which are corrigible
is honest, non-deceptive, and non-power-seeking
Hi mic,
I understand that’s how ‘alignment’ is normally defined in AI safety research.
But it seems like such a narrow notion of alignment that it glosses over almost all of the really hard problems in real AI safety—which concern the very real conflicts between the humans who will be using AI.
For example, if the AI is aligned ‘to the people who are allowed to provide feedback’ (eg the feedback to a CIRL system), that raises the question of who is actually going to be allowed to provide feedback. For most real-world applications, deciding that issue is tantamount to deciding which humans will be in control of that real-world domain—and it may leave the AI looking very ‘unaligned’ to all the other humans involved.
I very much agree these these political questions matter, and that alignment to multiple humans is conceptually pretty shaky; thanks for bringing up these issues. Still, I think some important context is that many AI safety researchers think that it’s a hard, unsolved problem to just keep future powerful AI systems from causing many deaths (or doing other unambiguously terrible things). They’re often worried that CIRL and every other approach that’s been proposed will completely fail. From that perspective, it no longer looks like almost all of the really hard problems are about conflicts between humans.
(On CIRL, here’s a thread and a longer writeup on why some think that “it almost entirely fails to address the core problems” of AI safety.)
I agree, that seems concerning. Ultimately, since the AI developers are designing the AIs, I would guess that they would try to align the AI to be helpful to the users/consumers or to the concerns of the company/government, if they succeed at aligning the AI at all. As for your suggestions “Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI’s company’s legal team says would impose the highest litigation risk?” – these all seem plausible to me.
On the separate question of handling conflicting interests: there’s some work on this (e.g., “Aligning with Heterogeneous Preferences for Kidney Exchange” and “Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning”), though perhaps not as much as we would like.
This is a great post.
Law is the best solution I can think to address the issues you raise.
Here https://forum.effectivealtruism.org/posts/9YLbtehKLT4ByLvos/agi-misalignment-x-risk-may-be-lower-due-to-an-overlooked I argue that law-informed AI is likely the best path forward for societal alignment
Here https://forum.effectivealtruism.org/posts/4ykDJA57wstYWq9HK/intent-alignment-should-not-be-the-goal-for-agi-x-risk I explore the difference between intent alignment and societal alignment.
Thanks; appreciate the feedback, and for sharing these links.
I agree that AI alignment with actual humans & groups needs to take law much more seriously as a legacy system for trying to manage ‘misalignments’ amongst actual humans and groups. New legal concepts may need to be invented—but AI alignment shouldn’t find itself in the hubristic position of trying to reinvent law from scratch.
I think AI alignment can draw from existing law to a large degree. New legal concepts may be needed but I think there is a lot legal reasoning, legal concepts, legal methods, etc. that are directly applicable now (discussed in more detail here https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4218031).
Also, I think we should keep the involvement of AI in law-making (broadly defined) as limited as we can. And we should train AI to understand when there is sufficient legal uncertainty that a human is needed to resolve the correct action to be taken.
It does seem like alignment researchers often focus on the case of aligning AI to a single human. Here are some views that might explain this. I think these views are at least somewhat common among alignment researchers.
Aligning with a single human contains most of the difficulty of the problem of aligning with groups of humans. Once we figure out how to align AI with a single human, figuring out how to align it with groups of humans will be relatively easy. We should focus on the hard part first, which is aligning AI with a single human. (edit: I am not saying that aligning with a single human is harder than aligning with groups of humans. See also my comment below.)
If AI is aligned with a single random human, this is still much better than unaligned AI. Therefore this kind of research is very valuable.
If the AI acts according to the CEV of a single random human, then the results will be probably good for humanity as a whole.
harfe—I’m not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests. Single human preferences can already be learned pretty well by product recommendation engines, but group preferences are much more complicated.
We already know from game theory that there is no way, even in principle, for a single agent (such as an AI) to represent and enact the collective interests of a group that doesn’t actually have internally aligned collective interests. The only exceptions to this are edge cases like pure coordination games (e.g. which side of the road to drive on, left or right).
My concern is that if we think we’ve solved the single-human alignment problem, we’ll be tempted to scale AI systems up to try to reflect the general preferences of human groups (or humanity in general) -- but that this will simply not be possible, given that groups, and even the human species itself, does not actually have collectively aligned interests (even the principle of ‘don’t let AI drive us extinct’ won’t seem aligned with the agendas of the Voluntary Human Extinction movement, the anti-natalists, the Earth First eco-activists, the religious extremists expecting imminent messiahs or Raptures, or the depressed nihilists.)
And, if group alignment isn’t possible, we’ll end up in a situation where whichever subgroup has the most direct control over AI design, training, and feedback will end up being basically in control of everybody else.
I think the original CEV paper from 2003 addresses (or at least discusses) a lot of these concerns. Basically, the thing that a group attempting to build an aligned AI should try to align it with is the collective CEV of humanity, not any individual humans.
On anti-natalism, religious extremism, voluntary extinction, etc. - if those values end up being stable under reflection, faster and more coherent thinking, and don’t end up dominated by other values of the people who hold them, then the Future may indeed include things which satisfy or maximize those values.
(Though those values, and the people that hold them don’t necessarily get more say than people who believe the opposite. If some interests and values are truly irreconcilable, a compromise might look like dividing up chunks of the lightcone.)
Of course, the first group who attempts to build a super-intelligence might try to align it with something else—their own personal CEV (which may or may not have a component for the collective CEV of humanity), or some kind of equal or unequal split between the individual CEVs of every human, or every sentient, etc. or something else entirely.
This would be inadvisable for various reasons discussed in the paper, and I agree it is a real danger / problem. (Mostly though, I think anyone who tries to build any kind of CEV sovereign right now just fails, and we end up with tiny molecular squiggles.)
Max—thanks for the reply. I’m familiar with the CEV concept. But I don’t see how it helps solve any of the hard problems in alignment that involve any conflicts of interest between human individuals, groups, corporations, or nation-states. It just sweeps all of those conflicts of interest under the rug.
In reality, corporations and nation-states won’t be building AIs to embody the collective CEV of humanity. They will build AIs to embody the profit-making or geopolitical interests of the builders.
We can preach at them that their AIs should embody humanity’s collective CEV. But they would get no comparative advantage from doing so. It wouldn’t help promote their group profit or power. It would be a purely altruistic act. So, given the current state of for-profit corporate governance, and for-power nation-state governance, that seems very unlikely.
Yep. I think in my ideal world, there would be exactly one operationally adequate organization permitted to build AGI. Membership in that organization would require a credible pledge to altruism and a test of oath-keeping ability.
Monopoly power of this organization to build AGI would be enforced by a global majority of nation states, with monitoring and deterrence against defection.
I think a stable equilibrium of that kind is possible in principle, though obviously we’re pretty far away from it being anywhere near the Overton Window. (For good reason—it’s a scary idea, and probably ends up looking pretty dystopian when implemented by existing Earth governments. Alas! Sometimes draconian measures really are necessary; reality is not always nice.)
In the absence of such a radically different global political order we might have to take our chances on the hope that the decision-makers at OpenAI, Deepmind, Anthropic, etc. will all be reasonably nice and altruistic, and not power / profit-seeking. Not great!
There might be worlds in between the most radical one sketched above and our current trajectory, but I worry that any “half measures” end up being ineffective and costly and worse than nothing, mirroring many countries’ approach to COVID lockdowns.
I did not claim that aligning with a single human is harder than aligning with a group of humans (nor have I claimed that others believe that). I have probably expressed myself poorly, if that was the impression after reading my comment. In fact, I believe the opposite!
Let my make another attempt at explaining.
A: Figuring out how to align an AGI with a single human.
B: Figuring out how to align an AGI with a group of human.
C: Doing B after you have completed A.
Then, for the difficulties of these, I currently believe
all three of A, B, C are hard
B is harder than A
B is harder than C
A is much harder than C (this was what I was trying to state in the comment above)
A reasonable strategy for doing B would be to do A, and then do C (I am not super confident here, and things might be much more complex)
If you do both A and C, it is better to first focus on A (and put more resources into it), because A is harder than C.
I would be curious what other people think. My current guess would be that at least some alignment researchers believe these (or a part of these) points too. I do not recall hearing opposing viewpoints.
I do not believe that, for example, the author of the PreDCA alignment proposal wishes that the values of a random human are imposed (via AGI) on the rest of humanity, even though PreDCA (currently) is a protocol that aligns AGI with a single human (called “user”).
Hi harfe, thanks for this helpful clarification.
I’d agree that A, B, and C seem hard; that B is harder than A, and that B is harder than C.
Where we disagree is that I suspect that C is harder than A, for basic game-theoretic reasons I mentioned in the original post.
I’m also not confident that C is a whole lot easier than B—I’m not sure that alignment with individual humans will actually give us all that much help in doing alignment with complicated groups of humans.
But, I need to think further about this, and do some more readings!
You claimed that your starting question was naive, so allow me to respond with similar naivete:
If AI become smart enough to perform behaviors that we consider potentially threatening or beyond our control, aren’t they really artificial life?
As such, and imbued with consciousness equal to or greater than our own, don’t we consider them to have rights or legal protections or freedoms?
If they did, they would also be subject to legal restrictions on their behavior, in similar fashion to human beings. However, with additional freedoms, those legal restrictions would be inadequate to punish their behavior. As a consequence, we face an ethical challenge in how we integrate such life into our society. We don’t want to be unfair, but we prefer them as servants than as equals or superiors.
The continued development of AI could reflect techno-utopianism, or technological determinism (marketing and memes), leading everyone to a condition in which the actual motives of people paying for it all are really poorly thought out and short-term, but the larger vision looks more attractive than it is.