Toby, would you be more optimistic for animals if we can align AGI to specific values rather than just making it corrigible to humans’ preferences and commands?
My impression is that pro-animal views are (dramatically?) overrepresented at Anthropic relative to the rest of society. If Anthropic gets to AGI first and instils/locks in pro-animal values in/to that AGI, that seems better for animals than if whoever gets to AGI first just makes it purely corrigible, because most humans who operate the purely corrigible AGI won’t be as pro-animal.
I think in the long-run I’d be more confident that corrigible AI would lead to good futures than AI that is aligned to specific values (besides perhaps some side-constraints). This is mainly because I’m pretty clueless and think our current values are likely to be wrong, and I’d rather we had more time to improve them.
I haven’t thought enough about the relationship between power concentration and corrigibility though—I expect that could change my mind.
Oh yes but I made the above comment more to represent the view that I’ve seen in some AI x Animals work that we should be working on aligning AGI to pro-animal values, through things like AnimalHarmBench etc..
This makes sense. I would worry about the purely corrigible AGI being used by actors in such a way that we never get to instil the correct/good/post-long-reflection values in AGI/ASI down the line.
Yep fair, that’s what I mean by “power concentration and corrigibility”. AGI being constrained by some values makes it at least minimally democratic (values are shaped by everyone who makes up a language, especially for LLMs).
Toby, would you be more optimistic for animals if we can align AGI to specific values rather than just making it corrigible to humans’ preferences and commands?
My impression is that pro-animal views are (dramatically?) overrepresented at Anthropic relative to the rest of society. If Anthropic gets to AGI first and instils/locks in pro-animal values in/to that AGI, that seems better for animals than if whoever gets to AGI first just makes it purely corrigible, because most humans who operate the purely corrigible AGI won’t be as pro-animal.
I think in the long-run I’d be more confident that corrigible AI would lead to good futures than AI that is aligned to specific values (besides perhaps some side-constraints). This is mainly because I’m pretty clueless and think our current values are likely to be wrong, and I’d rather we had more time to improve them.
I haven’t thought enough about the relationship between power concentration and corrigibility though—I expect that could change my mind.
Oh yes but I made the above comment more to represent the view that I’ve seen in some AI x Animals work that we should be working on aligning AGI to pro-animal values, through things like AnimalHarmBench etc..
This makes sense. I would worry about the purely corrigible AGI being used by actors in such a way that we never get to instil the correct/good/post-long-reflection values in AGI/ASI down the line.
Yep fair, that’s what I mean by “power concentration and corrigibility”. AGI being constrained by some values makes it at least minimally democratic (values are shaped by everyone who makes up a language, especially for LLMs).