harfe comments on AI alignment with humans… but with which humans?

harfe 9 Sep 2022 20:45 UTC
1 point
0 ∶ 0
I’m not convinced that aligning with a single human is much harder than aligning with a group of humans that have diverse and partly conflicting interests.

I did not claim that aligning with a single human is harder than aligning with a group of humans (nor have I claimed that others believe that). I have probably expressed myself poorly, if that was the impression after reading my comment. In fact, I believe the opposite!

Let my make another attempt at explaining.
- A: Figuring out how to align an AGI with a single human.
- B: Figuring out how to align an AGI with a group of human.
- C: Doing B after you have completed A.
Then, for the difficulties of these, I currently believe
- all three of A, B, C are hard
- B is harder than A
- B is harder than C
- A is much harder than C (this was what I was trying to state in the comment above)
- A reasonable strategy for doing B would be to do A, and then do C (I am not super confident here, and things might be much more complex)
- If you do both A and C, it is better to first focus on A (and put more resources into it), because A is harder than C.
I would be curious what other people think. My current guess would be that at least some alignment researchers believe these (or a part of these) points too. I do not recall hearing opposing viewpoints.

I do not believe that, for example, the author of the PreDCA alignment proposal wishes that the values of a random human are imposed (via AGI) on the rest of humanity, even though PreDCA (currently) is a protocol that aligns AGI with a single human (called “user”).
- Geoffrey Miller 9 Sep 2022 21:07 UTC
  3 points
  1 ∶ 0
  Parent
  Hi harfe, thanks for this helpful clarification.
  I’d agree that A, B, and C seem hard; that B is harder than A, and that B is harder than C.
  Where we disagree is that I suspect that C is harder than A, for basic game-theoretic reasons I mentioned in the original post.
  I’m also not confident that C is a whole lot easier than B—I’m not sure that alignment with individual humans will actually give us all that much help in doing alignment with complicated groups of humans.
  But, I need to think further about this, and do some more readings!