As far as I know, most current alignment work is going towards aligning ai with human values. If that’s successful then yay for us, but if we worked towards aligning ai with sentientist values (along the lines of “evidence, reason, and compassion for all sentient beings”), then we would also be in the group of valued beings. If people think that would go well for us, then I think it would make sense to think about ways to redirect more research towards aligning ai with all sentient beings, rather than just human values.
For example, humans. We are somewhat aligned with ourselves, but not with other animals, and that’s been catastrophic for animals (see factory farms, industrial fishing). If we encountered aliens that were more powerful than us, that had alignment like ours, they would not care about wiping us out (maybe a few of them would, but most wouldn’t). But if those aliens were aligned with all sentient beings, they would care. Say for some reason that very powerful aliens were somehow convinced by elephants to be aligned with elephants, we would still be on the chopping block along with every other species. So it’s in everyones interest to align them with all sentient beings, and in the process, we get alignment with us as well.
I would be interested in hearing why people might think ai that went well for animals would not go well for humans, I can imagine scenarios like that but they seem extremely unlikely to me.
I’m having a hard time putting what I mean into words, something like “alignment with all sentient beings gets alignment with humans for free whereas alignment with humans does not get alignment with other sentient beings for free” plus “alignment with all sentient beings is simpler than alignment with humans in particular”. I think the question I posed in my original comment would help determine whether someone agrees with the first part of this paragraph.
OK I think I get what you’re saying now. I think the statement “most current alignment work is going towards aligning ai with human values” is not true. Alignment work is primarily about how to point ASI at any goal at all without there being catastrophic unintended consequences.
It sounds to me like you’re saying the structure of the problem is like this:
problem 1: how to align AI to humans
problem 2: how to align AI to all sentient beings
and these are two totally separate problems, and alignment researchers are working on #1 to the exclusion of #2. Whereas I think really the structure is more like this:
problem 1: how to align AI to any goal whatsoever
problem 2: choosing what to align AI to
I think problem 2 is hard from a social perspective (if problem 1 is solved, how do you ensure that AI is aligned to care about animal welfare, even though many people won’t want that?), but easy from a technical perspective: once you figure out how to align AI to human values, the technical problems are pretty much solved. And alignment researchers are almost all working on problem 1, not problem 2.
In other words, almost all work on how to align AI to human values translates directly to the problem of how to align AI to sentientist values. It’s not like a totally separate problem.
Hmm, that opens up a lot of interesting conversation threads. I actually think some goals will be easier to align ai towards than others, for example we’ve aligned some ai to winning at chess and now they’re better than any human. Obviously that kind of goal is much simpler than any values framework that would be worth aligning agi too, but I think sentientist values would be easier to instill than “human values” (although not in the case of LLMs, I think they’re already basically “aligned” with human values and we now need to shift them towards caring more about all sentient beings). And on top of that, I think sentientist values will care enough about us and our values that a sentientist agi would “go well” for us.
But I’m not even close to an expert, so that’s all very tentative speculation.
for example we’ve aligned some ai to winning at chess and now they’re better than any human
Chess bots are narrow AI, not general AI, which makes the situation very different. We don’t know how to align an ASI to the goal of winning at chess. The most likely outcome would be some sort of severe misalignment—for example, maybe we think we trained the ASI to win at chess, but what actually maximizes its reward signal is the checkmate position, so it builds a fleet of robots to cut down every tree in the world to build trillions of chess sets and arranges every chess board into a checkmate position. See A simple case for extreme inner misalignment for more on why this sort of thing would happen.
Chess bots don’t do that because they have no concept of any world existing outside of the game they’re playing, which would not be the case for ASI.
Sorry, can you explain in more detail? I don’t understand what’s important about that question
As far as I know, most current alignment work is going towards aligning ai with human values. If that’s successful then yay for us, but if we worked towards aligning ai with sentientist values (along the lines of “evidence, reason, and compassion for all sentient beings”), then we would also be in the group of valued beings. If people think that would go well for us, then I think it would make sense to think about ways to redirect more research towards aligning ai with all sentient beings, rather than just human values.
For example, humans. We are somewhat aligned with ourselves, but not with other animals, and that’s been catastrophic for animals (see factory farms, industrial fishing). If we encountered aliens that were more powerful than us, that had alignment like ours, they would not care about wiping us out (maybe a few of them would, but most wouldn’t). But if those aliens were aligned with all sentient beings, they would care. Say for some reason that very powerful aliens were somehow convinced by elephants to be aligned with elephants, we would still be on the chopping block along with every other species. So it’s in everyones interest to align them with all sentient beings, and in the process, we get alignment with us as well.
I would be interested in hearing why people might think ai that went well for animals would not go well for humans, I can imagine scenarios like that but they seem extremely unlikely to me.
I’m having a hard time putting what I mean into words, something like “alignment with all sentient beings gets alignment with humans for free whereas alignment with humans does not get alignment with other sentient beings for free” plus “alignment with all sentient beings is simpler than alignment with humans in particular”. I think the question I posed in my original comment would help determine whether someone agrees with the first part of this paragraph.
OK I think I get what you’re saying now. I think the statement “most current alignment work is going towards aligning ai with human values” is not true. Alignment work is primarily about how to point ASI at any goal at all without there being catastrophic unintended consequences.
It sounds to me like you’re saying the structure of the problem is like this:
problem 1: how to align AI to humans
problem 2: how to align AI to all sentient beings
and these are two totally separate problems, and alignment researchers are working on #1 to the exclusion of #2. Whereas I think really the structure is more like this:
problem 1: how to align AI to any goal whatsoever
problem 2: choosing what to align AI to
I think problem 2 is hard from a social perspective (if problem 1 is solved, how do you ensure that AI is aligned to care about animal welfare, even though many people won’t want that?), but easy from a technical perspective: once you figure out how to align AI to human values, the technical problems are pretty much solved. And alignment researchers are almost all working on problem 1, not problem 2.
In other words, almost all work on how to align AI to human values translates directly to the problem of how to align AI to sentientist values. It’s not like a totally separate problem.
Hmm, that opens up a lot of interesting conversation threads. I actually think some goals will be easier to align ai towards than others, for example we’ve aligned some ai to winning at chess and now they’re better than any human. Obviously that kind of goal is much simpler than any values framework that would be worth aligning agi too, but I think sentientist values would be easier to instill than “human values” (although not in the case of LLMs, I think they’re already basically “aligned” with human values and we now need to shift them towards caring more about all sentient beings). And on top of that, I think sentientist values will care enough about us and our values that a sentientist agi would “go well” for us.
But I’m not even close to an expert, so that’s all very tentative speculation.
Chess bots are narrow AI, not general AI, which makes the situation very different. We don’t know how to align an ASI to the goal of winning at chess. The most likely outcome would be some sort of severe misalignment—for example, maybe we think we trained the ASI to win at chess, but what actually maximizes its reward signal is the checkmate position, so it builds a fleet of robots to cut down every tree in the world to build trillions of chess sets and arranges every chess board into a checkmate position. See A simple case for extreme inner misalignment for more on why this sort of thing would happen.
Chess bots don’t do that because they have no concept of any world existing outside of the game they’re playing, which would not be the case for ASI.