harfe—I’m not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests. Single human preferences can already be learned pretty well by product recommendation engines, but group preferences are much more complicated.
We already know from game theory that there is no way, even in principle, for a single agent (such as an AI) to represent and enact the collective interests of a group that doesn’t actually have internally aligned collective interests. The only exceptions to this are edge cases like pure coordination games (e.g. which side of the road to drive on, left or right).
My concern is that if we think we’ve solved the single-human alignment problem, we’ll be tempted to scale AI systems up to try to reflect the general preferences of human groups (or humanity in general) -- but that this will simply not be possible, given that groups, and even the human species itself, does not actually have collectively aligned interests (even the principle of ‘don’t let AI drive us extinct’ won’t seem aligned with the agendas of the Voluntary Human Extinction movement, the anti-natalists, the Earth First eco-activists, the religious extremists expecting imminent messiahs or Raptures, or the depressed nihilists.)
And, if group alignment isn’t possible, we’ll end up in a situation where whichever subgroup has the most direct control over AI design, training, and feedback will end up being basically in control of everybody else.
I think the original CEV paper from 2003 addresses (or at least discusses) a lot of these concerns. Basically, the thing that a group attempting to build an aligned AI should try to align it with is the collective CEV of humanity, not any individual humans.
On anti-natalism, religious extremism, voluntary extinction, etc. - if those values end up being stable under reflection, faster and more coherent thinking, and don’t end up dominated by other values of the people who hold them, then the Future may indeed include things which satisfy or maximize those values.
(Though those values, and the people that hold them don’t necessarily get more say than people who believe the opposite. If some interests and values are truly irreconcilable, a compromise might look like dividing up chunks of the lightcone.)
Of course, the first group who attempts to build a super-intelligence might try to align it with something else—their own personal CEV (which may or may not have a component for the collective CEV of humanity), or some kind of equal or unequal split between the individual CEVs of every human, or every sentient, etc. or something else entirely.
This would be inadvisable for various reasons discussed in the paper, and I agree it is a real danger / problem. (Mostly though, I think anyone who tries to build any kind of CEV sovereign right now just fails, and we end up with tiny molecular squiggles.)
Max—thanks for the reply. I’m familiar with the CEV concept. But I don’t see how it helps solve any of the hard problems in alignment that involve any conflicts of interest between human individuals, groups, corporations, or nation-states. It just sweeps all of those conflicts of interest under the rug.
In reality, corporations and nation-states won’t be building AIs to embody the collective CEV of humanity. They will build AIs to embody the profit-making or geopolitical interests of the builders.
We can preach at them that their AIs should embody humanity’s collective CEV. But they would get no comparative advantage from doing so. It wouldn’t help promote their group profit or power. It would be a purely altruistic act. So, given the current state of for-profit corporate governance, and for-power nation-state governance, that seems very unlikely.
So, given the current state of for-profit corporate governance, and for-power nation-state governance, that seems very unlikely.
Yep. I think in my ideal world, there would be exactly one operationally adequate organization permitted to build AGI. Membership in that organization would require a credible pledge to altruism and a test of oath-keeping ability.
Monopoly power of this organization to build AGI would be enforced by a global majority of nation states, with monitoring and deterrence against defection.
I think a stable equilibrium of that kind is possible in principle, though obviously we’re pretty far away from it being anywhere near the Overton Window. (For good reason—it’s a scary idea, and probably ends up looking pretty dystopian when implemented by existing Earth governments. Alas! Sometimes draconian measures really are necessary; reality is not always nice.)
In the absence of such a radically different global political order we might have to take our chances on the hope that the decision-makers at OpenAI, Deepmind, Anthropic, etc. will all be reasonably nice and altruistic, and not power / profit-seeking. Not great!
There might be worlds in between the most radical one sketched above and our current trajectory, but I worry that any “half measures” end up being ineffective and costly and worse than nothing, mirroring many countries’ approach to COVID lockdowns.
I’m not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests.
I did not claim that aligning with a single human is harder than aligning with a group of humans (nor have I claimed that others believe that). I have probably expressed myself poorly, if that was the impression after reading my comment. In fact, I believe the opposite!
Let my make another attempt at explaining.
A: Figuring out how to align an AGI with a single human.
B: Figuring out how to align an AGI with a group of human.
C: Doing B after you have completed A.
Then, for the difficulties of these, I currently believe
all three of A, B, C are hard
B is harder than A
B is harder than C
A is much harder than C (this was what I was trying to state in the comment above)
A reasonable strategy for doing B would be to do A, and then do C (I am not super confident here, and things might be much more complex)
If you do both A and C, it is better to first focus on A (and put more resources into it), because A is harder than C.
I would be curious what other people think. My current guess would be that at least some alignment researchers believe these (or a part of these) points too. I do not recall hearing opposing viewpoints.
I do not believe that, for example, the author of the PreDCA alignment proposal wishes that the values of a random human are imposed (via AGI) on the rest of humanity, even though PreDCA (currently) is a protocol that aligns AGI with a single human (called “user”).
I’d agree that A, B, and C seem hard; that B is harder than A, and that B is harder than C.
Where we disagree is that I suspect that C is harder than A, for basic game-theoretic reasons I mentioned in the original post.
I’m also not confident that C is a whole lot easier than B—I’m not sure that alignment with individual humans will actually give us all that much help in doing alignment with complicated groups of humans.
But, I need to think further about this, and do some more readings!
harfe—I’m not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests. Single human preferences can already be learned pretty well by product recommendation engines, but group preferences are much more complicated.
We already know from game theory that there is no way, even in principle, for a single agent (such as an AI) to represent and enact the collective interests of a group that doesn’t actually have internally aligned collective interests. The only exceptions to this are edge cases like pure coordination games (e.g. which side of the road to drive on, left or right).
My concern is that if we think we’ve solved the single-human alignment problem, we’ll be tempted to scale AI systems up to try to reflect the general preferences of human groups (or humanity in general) -- but that this will simply not be possible, given that groups, and even the human species itself, does not actually have collectively aligned interests (even the principle of ‘don’t let AI drive us extinct’ won’t seem aligned with the agendas of the Voluntary Human Extinction movement, the anti-natalists, the Earth First eco-activists, the religious extremists expecting imminent messiahs or Raptures, or the depressed nihilists.)
And, if group alignment isn’t possible, we’ll end up in a situation where whichever subgroup has the most direct control over AI design, training, and feedback will end up being basically in control of everybody else.
I think the original CEV paper from 2003 addresses (or at least discusses) a lot of these concerns. Basically, the thing that a group attempting to build an aligned AI should try to align it with is the collective CEV of humanity, not any individual humans.
On anti-natalism, religious extremism, voluntary extinction, etc. - if those values end up being stable under reflection, faster and more coherent thinking, and don’t end up dominated by other values of the people who hold them, then the Future may indeed include things which satisfy or maximize those values.
(Though those values, and the people that hold them don’t necessarily get more say than people who believe the opposite. If some interests and values are truly irreconcilable, a compromise might look like dividing up chunks of the lightcone.)
Of course, the first group who attempts to build a super-intelligence might try to align it with something else—their own personal CEV (which may or may not have a component for the collective CEV of humanity), or some kind of equal or unequal split between the individual CEVs of every human, or every sentient, etc. or something else entirely.
This would be inadvisable for various reasons discussed in the paper, and I agree it is a real danger / problem. (Mostly though, I think anyone who tries to build any kind of CEV sovereign right now just fails, and we end up with tiny molecular squiggles.)
Max—thanks for the reply. I’m familiar with the CEV concept. But I don’t see how it helps solve any of the hard problems in alignment that involve any conflicts of interest between human individuals, groups, corporations, or nation-states. It just sweeps all of those conflicts of interest under the rug.
In reality, corporations and nation-states won’t be building AIs to embody the collective CEV of humanity. They will build AIs to embody the profit-making or geopolitical interests of the builders.
We can preach at them that their AIs should embody humanity’s collective CEV. But they would get no comparative advantage from doing so. It wouldn’t help promote their group profit or power. It would be a purely altruistic act. So, given the current state of for-profit corporate governance, and for-power nation-state governance, that seems very unlikely.
Yep. I think in my ideal world, there would be exactly one operationally adequate organization permitted to build AGI. Membership in that organization would require a credible pledge to altruism and a test of oath-keeping ability.
Monopoly power of this organization to build AGI would be enforced by a global majority of nation states, with monitoring and deterrence against defection.
I think a stable equilibrium of that kind is possible in principle, though obviously we’re pretty far away from it being anywhere near the Overton Window. (For good reason—it’s a scary idea, and probably ends up looking pretty dystopian when implemented by existing Earth governments. Alas! Sometimes draconian measures really are necessary; reality is not always nice.)
In the absence of such a radically different global political order we might have to take our chances on the hope that the decision-makers at OpenAI, Deepmind, Anthropic, etc. will all be reasonably nice and altruistic, and not power / profit-seeking. Not great!
There might be worlds in between the most radical one sketched above and our current trajectory, but I worry that any “half measures” end up being ineffective and costly and worse than nothing, mirroring many countries’ approach to COVID lockdowns.
I did not claim that aligning with a single human is harder than aligning with a group of humans (nor have I claimed that others believe that). I have probably expressed myself poorly, if that was the impression after reading my comment. In fact, I believe the opposite!
Let my make another attempt at explaining.
A: Figuring out how to align an AGI with a single human.
B: Figuring out how to align an AGI with a group of human.
C: Doing B after you have completed A.
Then, for the difficulties of these, I currently believe
all three of A, B, C are hard
B is harder than A
B is harder than C
A is much harder than C (this was what I was trying to state in the comment above)
A reasonable strategy for doing B would be to do A, and then do C (I am not super confident here, and things might be much more complex)
If you do both A and C, it is better to first focus on A (and put more resources into it), because A is harder than C.
I would be curious what other people think. My current guess would be that at least some alignment researchers believe these (or a part of these) points too. I do not recall hearing opposing viewpoints.
I do not believe that, for example, the author of the PreDCA alignment proposal wishes that the values of a random human are imposed (via AGI) on the rest of humanity, even though PreDCA (currently) is a protocol that aligns AGI with a single human (called “user”).
Hi harfe, thanks for this helpful clarification.
I’d agree that A, B, and C seem hard; that B is harder than A, and that B is harder than C.
Where we disagree is that I suspect that C is harder than A, for basic game-theoretic reasons I mentioned in the original post.
I’m also not confident that C is a whole lot easier than B—I’m not sure that alignment with individual humans will actually give us all that much help in doing alignment with complicated groups of humans.
But, I need to think further about this, and do some more readings!