For example, I think a crux might be the tractability of animal-specific alignment work. e.g. can we align AI to specific values or (just) make it corrigible to our preferences and commands? I donât know, but this would massively affect my estimation of the tractability here.
This is definitely a hard debate to disentangle, because I would personally reject the question of alignment as a crux. For now, I strongly believe that the total welfare of animals has been entirely uncorrelated with our moral intentions toward animals. Total welfare has mostly changed because of land use, due to human interests.
I agree that in AGI-transformed futures that go well for humans, human desires may start playing a larger role. However, I expect that whether we mean well for animals (or donât care much about them) will not be cleanly correlated with outcomes for them.
There are worlds where we mean well for a large part of animals, stop intentionally killing them, and help certain wild animals. But that world could very well end up having a large population of animals living bad lives.
On the other hand, out of apathy and even negative feeling toward wild animals, we may decide to limit their spread and use resources in a way that optimizes for human flourishing, over animal abundance. That world could end up being much better for animal welfare.
Maybe some extreme scenarios tip the scales, for example if we bred incredibly happy genetically modified animals due to positive feelings toward them. But Iâm not confident on putting any weight on such utilitarian-leaning scenarios when assessing post-AGI futures. Because part of the reason human moral intentions are not correlated with total animal welfare is that humans are not scope-sensitive utilitarians.
What kinds of values will humans have post-AGI, if AGI goes well for us? We donât need to be scope-sensitive utilitarians to want to adopt even radical preferences like ending animal exploitation and solving WAS, no? (Most humans donât like factory farming or the idea of cute animals being eaten alive.)
Solving WAS intuitively seems too niche for people to deliberately change their mind on that, but I could be wrong. After all, the Bible says that the Lion will lie down with the lamb and eat straw like the ox, so it could be that human preferences tend to come back to the idea that animal suffering can be bad even when it doesnât depend on human actions.
I guess the causal mechanism Iâm thinking of here is:
Most humans feel at least a little sad when they see a baby gazelle being eaten alive by hyenas
AGI is so powerful that humans can order it to do things like âstop baby gazelles being eaten alive whilst retaining the beauty of nature and the complexity of ecosystemsâ and then itâll just go away and do it somehow
Maybe this is foolish and naive on my part! And maybe Iâm wrong to think our moral preferences/âintuitions will be so robust to the disruption of AGI, even if AGI goes well for us.
Toby, would you be more optimistic for animals if we can align AGI to specific values rather than just making it corrigible to humansâ preferences and commands?
My impression is that pro-animal views are (dramatically?) overrepresented at Anthropic relative to the rest of society. If Anthropic gets to AGI first and instils/âlocks in pro-animal values in/âto that AGI, that seems better for animals than if whoever gets to AGI first just makes it purely corrigible, because most humans who operate the purely corrigible AGI wonât be as pro-animal.
I think in the long-run Iâd be more confident that corrigible AI would lead to good futures than AI that is aligned to specific values (besides perhaps some side-constraints). This is mainly because Iâm pretty clueless and think our current values are likely to be wrong, and Iâd rather we had more time to improve them.
I havenât thought enough about the relationship between power concentration and corrigibility thoughâI expect that could change my mind.
Oh yes but I made the above comment more to represent the view that Iâve seen in some AI x Animals work that we should be working on aligning AGI to pro-animal values, through things like AnimalHarmBench etc..
This makes sense. I would worry about the purely corrigible AGI being used by actors in such a way that we never get to instil the correct/âgood/âpost-long-reflection values in AGI/âASI down the line.
Yep fair, thatâs what I mean by âpower concentration and corrigibilityâ. AGI being constrained by some values makes it at least minimally democratic (values are shaped by everyone who makes up a language, especially for LLMs).
For example, I think a crux might be the tractability of animal-specific alignment work. e.g. can we align AI to specific values or (just) make it corrigible to our preferences and commands? I donât know, but this would massively affect my estimation of the tractability here.
This is definitely a hard debate to disentangle, because I would personally reject the question of alignment as a crux. For now, I strongly believe that the total welfare of animals has been entirely uncorrelated with our moral intentions toward animals. Total welfare has mostly changed because of land use, due to human interests.
I agree that in AGI-transformed futures that go well for humans, human desires may start playing a larger role. However, I expect that whether we mean well for animals (or donât care much about them) will not be cleanly correlated with outcomes for them.
There are worlds where we mean well for a large part of animals, stop intentionally killing them, and help certain wild animals. But that world could very well end up having a large population of animals living bad lives.
On the other hand, out of apathy and even negative feeling toward wild animals, we may decide to limit their spread and use resources in a way that optimizes for human flourishing, over animal abundance. That world could end up being much better for animal welfare.
Maybe some extreme scenarios tip the scales, for example if we bred incredibly happy genetically modified animals due to positive feelings toward them. But Iâm not confident on putting any weight on such utilitarian-leaning scenarios when assessing post-AGI futures. Because part of the reason human moral intentions are not correlated with total animal welfare is that humans are not scope-sensitive utilitarians.
What kinds of values will humans have post-AGI, if AGI goes well for us? We donât need to be scope-sensitive utilitarians to want to adopt even radical preferences like ending animal exploitation and solving WAS, no? (Most humans donât like factory farming or the idea of cute animals being eaten alive.)
Solving WAS intuitively seems too niche for people to deliberately change their mind on that, but I could be wrong. After all, the Bible says that the Lion will lie down with the lamb and eat straw like the ox, so it could be that human preferences tend to come back to the idea that animal suffering can be bad even when it doesnât depend on human actions.
I guess the causal mechanism Iâm thinking of here is:
Most humans feel at least a little sad when they see a baby gazelle being eaten alive by hyenas
AGI is so powerful that humans can order it to do things like âstop baby gazelles being eaten alive whilst retaining the beauty of nature and the complexity of ecosystemsâ and then itâll just go away and do it somehow
Maybe this is foolish and naive on my part! And maybe Iâm wrong to think our moral preferences/âintuitions will be so robust to the disruption of AGI, even if AGI goes well for us.
PS- looks like Michael Dickens just posted on this.
Toby, would you be more optimistic for animals if we can align AGI to specific values rather than just making it corrigible to humansâ preferences and commands?
My impression is that pro-animal views are (dramatically?) overrepresented at Anthropic relative to the rest of society. If Anthropic gets to AGI first and instils/âlocks in pro-animal values in/âto that AGI, that seems better for animals than if whoever gets to AGI first just makes it purely corrigible, because most humans who operate the purely corrigible AGI wonât be as pro-animal.
I think in the long-run Iâd be more confident that corrigible AI would lead to good futures than AI that is aligned to specific values (besides perhaps some side-constraints). This is mainly because Iâm pretty clueless and think our current values are likely to be wrong, and Iâd rather we had more time to improve them.
I havenât thought enough about the relationship between power concentration and corrigibility thoughâI expect that could change my mind.
Oh yes but I made the above comment more to represent the view that Iâve seen in some AI x Animals work that we should be working on aligning AGI to pro-animal values, through things like AnimalHarmBench etc..
This makes sense. I would worry about the purely corrigible AGI being used by actors in such a way that we never get to instil the correct/âgood/âpost-long-reflection values in AGI/âASI down the line.
Yep fair, thatâs what I mean by âpower concentration and corrigibilityâ. AGI being constrained by some values makes it at least minimally democratic (values are shaped by everyone who makes up a language, especially for LLMs).