I wonder to what extent people take the alignment problem to be the problem of
(i) creating an AI system that reliaby does or tries to do what its operators want it do as opposed to
(ii) the problem of creating an AI system that does or tries to do what is best “aligned with human values” (whatever this precisely means).
I see both definitions being used and they feel importantly different to me: if we solve the problem of aligning an AI with some operator, then this seems far away from safe AI. In fact, when I try to imagine how an AI might cause a catastrophe, the clearest causal path for me would be one where the AI is extremely competent at pursuing the operators stated goal, but that stated goal implies or requires catastrophe (e.g. the superintelligent AI receives some input like “Make me the emperor of the world”). On the other hand, if the AI system is aligned with humanity as a whole (in some sense), this scenario seems less likely.
I mostly back-chain from a goal that I’d call “make the future go well”. This usually maps to value-aligning AI with broad human values, so that the future is full of human goodness and not tainted by my own personal fingerprints. Actually, ideally we first build an AI that we have the kind of control over so that the operators can make it do something that is less drastic than determining the entire future of humanity, e.g. slowing down AI progress to a halt until humanity pulls itself together and figures out more safe alignment techniques. That usually means making it corrigible or tool-like, instead of letting it maximize its aligned values.
So I guess I ultimately want (ii) but really hope we can get a form of (i) as an intermediate step.
When I talk about the “alignment problem” I usually refer to the problem that we by default get neither (i) nor (ii).
I wonder to what extent people take the alignment problem to be the problem of (i) creating an AI system that reliaby does or tries to do what its operators want it do as opposed to (ii) the problem of creating an AI system that does or tries to do what is best “aligned with human values” (whatever this precisely means).
I see both definitions being used and they feel importantly different to me: if we solve the problem of aligning an AI with some operator, then this seems far away from safe AI. In fact, when I try to imagine how an AI might cause a catastrophe, the clearest causal path for me would be one where the AI is extremely competent at pursuing the operators stated goal, but that stated goal implies or requires catastrophe (e.g. the superintelligent AI receives some input like “Make me the emperor of the world”). On the other hand, if the AI system is aligned with humanity as a whole (in some sense), this scenario seems less likely.
Does that seem right to you?
I mostly back-chain from a goal that I’d call “make the future go well”. This usually maps to value-aligning AI with broad human values, so that the future is full of human goodness and not tainted by my own personal fingerprints. Actually, ideally we first build an AI that we have the kind of control over so that the operators can make it do something that is less drastic than determining the entire future of humanity, e.g. slowing down AI progress to a halt until humanity pulls itself together and figures out more safe alignment techniques. That usually means making it corrigible or tool-like, instead of letting it maximize its aligned values.
So I guess I ultimately want (ii) but really hope we can get a form of (i) as an intermediate step.
When I talk about the “alignment problem” I usually refer to the problem that we by default get neither (i) nor (ii).