OK I think I get what you’re saying now. I think the statement “most current alignment work is going towards aligning ai with human values” is not true. Alignment work is primarily about how to point ASI at any goal at all without there being catastrophic unintended consequences.
It sounds to me like you’re saying the structure of the problem is like this:
problem 1: how to align AI to humans
problem 2: how to align AI to all sentient beings
and these are two totally separate problems, and alignment researchers are working on #1 to the exclusion of #2. Whereas I think really the structure is more like this:
problem 1: how to align AI to any goal whatsoever
problem 2: choosing what to align AI to
I think problem 2 is hard from a social perspective (if problem 1 is solved, how do you ensure that AI is aligned to care about animal welfare, even though many people won’t want that?), but easy from a technical perspective: once you figure out how to align AI to human values, the technical problems are pretty much solved. And alignment researchers are almost all working on problem 1, not problem 2.
In other words, almost all work on how to align AI to human values translates directly to the problem of how to align AI to sentientist values. It’s not like a totally separate problem.
Hmm, that opens up a lot of interesting conversation threads. I actually think some goals will be easier to align ai towards than others, for example we’ve aligned some ai to winning at chess and now they’re better than any human. Obviously that kind of goal is much simpler than any values framework that would be worth aligning agi too, but I think sentientist values would be easier to instill than “human values” (although not in the case of LLMs, I think they’re already basically “aligned” with human values and we now need to shift them towards caring more about all sentient beings). And on top of that, I think sentientist values will care enough about us and our values that a sentientist agi would “go well” for us.
But I’m not even close to an expert, so that’s all very tentative speculation.
for example we’ve aligned some ai to winning at chess and now they’re better than any human
Chess bots are narrow AI, not general AI, which makes the situation very different. We don’t know how to align an ASI to the goal of winning at chess. The most likely outcome would be some sort of severe misalignment—for example, maybe we think we trained the ASI to win at chess, but what actually maximizes its reward signal is the checkmate position, so it builds a fleet of robots to cut down every tree in the world to build trillions of chess sets and arranges every chess board into a checkmate position. See A simple case for extreme inner misalignment for more on why this sort of thing would happen.
Chess bots don’t do that because they have no concept of any world existing outside of the game they’re playing, which would not be the case for ASI.
ETA: That’s also why a lot of people oppose building ASI but still want to build powerful-but-narrow AIs like AlphaFold.
OK I think I get what you’re saying now. I think the statement “most current alignment work is going towards aligning ai with human values” is not true. Alignment work is primarily about how to point ASI at any goal at all without there being catastrophic unintended consequences.
It sounds to me like you’re saying the structure of the problem is like this:
problem 1: how to align AI to humans
problem 2: how to align AI to all sentient beings
and these are two totally separate problems, and alignment researchers are working on #1 to the exclusion of #2. Whereas I think really the structure is more like this:
problem 1: how to align AI to any goal whatsoever
problem 2: choosing what to align AI to
I think problem 2 is hard from a social perspective (if problem 1 is solved, how do you ensure that AI is aligned to care about animal welfare, even though many people won’t want that?), but easy from a technical perspective: once you figure out how to align AI to human values, the technical problems are pretty much solved. And alignment researchers are almost all working on problem 1, not problem 2.
In other words, almost all work on how to align AI to human values translates directly to the problem of how to align AI to sentientist values. It’s not like a totally separate problem.
Hmm, that opens up a lot of interesting conversation threads. I actually think some goals will be easier to align ai towards than others, for example we’ve aligned some ai to winning at chess and now they’re better than any human. Obviously that kind of goal is much simpler than any values framework that would be worth aligning agi too, but I think sentientist values would be easier to instill than “human values” (although not in the case of LLMs, I think they’re already basically “aligned” with human values and we now need to shift them towards caring more about all sentient beings). And on top of that, I think sentientist values will care enough about us and our values that a sentientist agi would “go well” for us.
But I’m not even close to an expert, so that’s all very tentative speculation.
Chess bots are narrow AI, not general AI, which makes the situation very different. We don’t know how to align an ASI to the goal of winning at chess. The most likely outcome would be some sort of severe misalignment—for example, maybe we think we trained the ASI to win at chess, but what actually maximizes its reward signal is the checkmate position, so it builds a fleet of robots to cut down every tree in the world to build trillions of chess sets and arranges every chess board into a checkmate position. See A simple case for extreme inner misalignment for more on why this sort of thing would happen.
Chess bots don’t do that because they have no concept of any world existing outside of the game they’re playing, which would not be the case for ASI.
ETA: That’s also why a lot of people oppose building ASI but still want to build powerful-but-narrow AIs like AlphaFold.