Empirical disagreements between groups are often more consequential than normative ones, so just pointing out what groups with a given normative stance focus on may not be a strong argument.
I found Steven Byrnes’ arguments in the other thread somewhat convincing (in particular, the part where maybe no one wants to do ambitious value learning). That said, next to ambitious value-learning, there are intermediates like value-agnostic expansion where an AI learns the short-term preferences of a human overseer, like “keep the overseer comfortable and in control.” And those short-term preferences can themselves backfire, because the humans will stick around in protected bubbles, and they can be attacked. So, at the very least, AI alignment research that takes s-risks extremely seriously would spend >10% of its time thinking through failures modes of various alignment schemes, particularly focused on “Where might hyperexistential separation fail?”
“Maybe no one” is actually an overstatement, sorry, here are some exceptions: 1,2,3. (I have corrected my previous comment.)
I guess I think of current value learning work as being principally along the lines of “What does value learning even mean? How do we operationalize that?” And if we’re still confused about that question, it makes it a bit hard to productively think about failure modes.
It seems pretty clear to me that “unprincipled, making-it-up-as-you-go-along, alignment schemes” would be bad for s-risks, for such reasons as you mentioned. So trying to gain clarity about the lay of the land seems good.
If you keep humans around, they can decide on how to respond to threats and gradually improve their policies as they figure out more (or their AIs figure out more).
If you build incorrigible AIs who will override human preferences (so that a threatened human has no ability to change the behavior of their AI), while themselves being resistant to threats, then you may indeed reduce the likelihood of threats being carried out.
But in practice all the value is coming from you solving “how do we deal with threats?” at the same time that you solved the alignment problem.
I don’t think there’s any real argument that solving CEV or ambitious value learning per se helps with these difficulties, except insofar as your AI was able to answer these questions. But in that case a corrigible AI could also answer those questions.
Humans may ultimately build incorrigible AI for decision-theoretic reasons, but I think the decision should do so should probably be separated from solving alignment.
I think the deepest coupling comes from the fact that the construction of incorrigible AI is itself an existential risk, and so it may be extremely harmful to build technology that enables that prior to having norms and culture that are able to use it responsibly.
Overall, I’m much less sure than you that “making it up as you go along alignment” is bad for s-risk.
And those short-term preferences can themselves backfire, because the humans will stick around in protected bubbles, and they can be attacked.
I’m not 100% sure I understand; could you elaborate a little? Is the idea that the human overseer’s values could value punishing some out-group or something else?
Empirical disagreements between groups are often more consequential than normative ones, so just pointing out what groups with a given normative stance focus on may not be a strong argument.
I found Steven Byrnes’ arguments in the other thread somewhat convincing (in particular, the part where maybe no one wants to do ambitious value learning). That said, next to ambitious value-learning, there are intermediates like value-agnostic expansion where an AI learns the short-term preferences of a human overseer, like “keep the overseer comfortable and in control.” And those short-term preferences can themselves backfire, because the humans will stick around in protected bubbles, and they can be attacked. So, at the very least, AI alignment research that takes s-risks extremely seriously would spend >10% of its time thinking through failures modes of various alignment schemes, particularly focused on “Where might hyperexistential separation fail?”
“Maybe no one” is actually an overstatement, sorry, here are some exceptions: 1,2,3. (I have corrected my previous comment.)
I guess I think of current value learning work as being principally along the lines of “What does value learning even mean? How do we operationalize that?” And if we’re still confused about that question, it makes it a bit hard to productively think about failure modes.
It seems pretty clear to me that “unprincipled, making-it-up-as-you-go-along, alignment schemes” would be bad for s-risks, for such reasons as you mentioned. So trying to gain clarity about the lay of the land seems good.
Regarding susceptibility to s-risk:
If you keep humans around, they can decide on how to respond to threats and gradually improve their policies as they figure out more (or their AIs figure out more).
If you build incorrigible AIs who will override human preferences (so that a threatened human has no ability to change the behavior of their AI), while themselves being resistant to threats, then you may indeed reduce the likelihood of threats being carried out.
But in practice all the value is coming from you solving “how do we deal with threats?” at the same time that you solved the alignment problem.
I don’t think there’s any real argument that solving CEV or ambitious value learning per se helps with these difficulties, except insofar as your AI was able to answer these questions. But in that case a corrigible AI could also answer those questions.
Humans may ultimately build incorrigible AI for decision-theoretic reasons, but I think the decision should do so should probably be separated from solving alignment.
I think the deepest coupling comes from the fact that the construction of incorrigible AI is itself an existential risk, and so it may be extremely harmful to build technology that enables that prior to having norms and culture that are able to use it responsibly.
Overall, I’m much less sure than you that “making it up as you go along alignment” is bad for s-risk.
Hi, regarding this part:
I’m not 100% sure I understand; could you elaborate a little? Is the idea that the human overseer’s values could value punishing some out-group or something else?