Question 2:
Suppose tomorrow MIRI creates a friendly AGI that can learn a value system, make it consistent with minimal alteration, and extrapolate it in an agreeable way. Whose values would it be taught?
I’ve heard the idea of averaging all humans’ values together and working from there. Given that ISIS is human and that many other humans believe that the existence of extreme physical and emotional suffering is good, I find that idea pretty repellent. Are there alternatives that have been considered?
Right now, we’re trying to ensure that people down the road can build AGI systems that it’s technologically possible to align with operators’ interests at all. We expect that early systems should be punting on those moral hazards and diffusing them as much as possible, rather than trying to lock in answers to tough philosophical questions on the first go.
That said, we’ve thought about this some. One proposal by Eliezer years ago was coherent extrapolated volition (CEV), which (roughly) deals with this problem by basing decisions on what we’d do “if counterfactually we knew everything the AI knew; we could think as fast as the AI and consider all the arguments; [and] we knew ourselves perfectly and had better self-control or self-modification ability.” We aren’t shooting for a CEV-based system right now, but that sounds like a plausible guess about what we’d want researchers to eventually develop, when our institutions and technical knowledge are much more mature.
It’s clear that we want to take the interests and preferences of religious extremists into account in making decisions, since they’re people too and their welfare matters. (The welfare of non-human sentient beings should be taken into account too.)
You might argue that their welfare matters, but they aren’t good sources of moral insight: “it’s bad to torture people on a whim, even religious militants” is a moral insight you can already get without consulting a religious militant, and perhaps adding the religious militant’s insights is harmful (or just unhelpful).
The idea behind CEV might help here if we can find some reasonable way to aggregate extrapolated preferences. Rather than relying on what people want in today’s world, you simulate what people would want if they knew more, were more reflectively consistent, etc. A nice feature of this idea is that ISIS-ish problems might go away, as more knowledge causes more irreligion. A second nice feature of this idea is that many religious extremists’ representatives at some hypothetical AI consortium could potentially endorse the same design principles as ZachWeems’ representatives: both sides may think that they’ll “win the argument” and convince the other side if they accumulate and share enough evidence, so this framework can reduce incentives to get into zero-sum conflicts.
Of course, ISIS won’t literally be at the bargaining table. But ‘take the CEV of all sentient beings’ is a nice Schelling point, which reduces the risk that we’ll just replace first-order conflicts about what AI systems should do with second-order conflicts about what people do or don’t get to inform the process. If some people get kicked out, then AI designers may be tempted to kick other people out who are less obviously lacking in important moral insights—or people may just worry more about the possibility that designers will do that, leading to conflict. So there are practical reasons to be egalitarian here even if you don’t literally need to simulate every human (or cow, pig, etc.) in order to get good global outcomes.
Question 2: Suppose tomorrow MIRI creates a friendly AGI that can learn a value system, make it consistent with minimal alteration, and extrapolate it in an agreeable way. Whose values would it be taught?
I’ve heard the idea of averaging all humans’ values together and working from there. Given that ISIS is human and that many other humans believe that the existence of extreme physical and emotional suffering is good, I find that idea pretty repellent. Are there alternatives that have been considered?
Right now, we’re trying to ensure that people down the road can build AGI systems that it’s technologically possible to align with operators’ interests at all. We expect that early systems should be punting on those moral hazards and diffusing them as much as possible, rather than trying to lock in answers to tough philosophical questions on the first go.
That said, we’ve thought about this some. One proposal by Eliezer years ago was coherent extrapolated volition (CEV), which (roughly) deals with this problem by basing decisions on what we’d do “if counterfactually we knew everything the AI knew; we could think as fast as the AI and consider all the arguments; [and] we knew ourselves perfectly and had better self-control or self-modification ability.” We aren’t shooting for a CEV-based system right now, but that sounds like a plausible guess about what we’d want researchers to eventually develop, when our institutions and technical knowledge are much more mature.
It’s clear that we want to take the interests and preferences of religious extremists into account in making decisions, since they’re people too and their welfare matters. (The welfare of non-human sentient beings should be taken into account too.)
You might argue that their welfare matters, but they aren’t good sources of moral insight: “it’s bad to torture people on a whim, even religious militants” is a moral insight you can already get without consulting a religious militant, and perhaps adding the religious militant’s insights is harmful (or just unhelpful).
The idea behind CEV might help here if we can find some reasonable way to aggregate extrapolated preferences. Rather than relying on what people want in today’s world, you simulate what people would want if they knew more, were more reflectively consistent, etc. A nice feature of this idea is that ISIS-ish problems might go away, as more knowledge causes more irreligion. A second nice feature of this idea is that many religious extremists’ representatives at some hypothetical AI consortium could potentially endorse the same design principles as ZachWeems’ representatives: both sides may think that they’ll “win the argument” and convince the other side if they accumulate and share enough evidence, so this framework can reduce incentives to get into zero-sum conflicts.
Of course, ISIS won’t literally be at the bargaining table. But ‘take the CEV of all sentient beings’ is a nice Schelling point, which reduces the risk that we’ll just replace first-order conflicts about what AI systems should do with second-order conflicts about what people do or don’t get to inform the process. If some people get kicked out, then AI designers may be tempted to kick other people out who are less obviously lacking in important moral insights—or people may just worry more about the possibility that designers will do that, leading to conflict. So there are practical reasons to be egalitarian here even if you don’t literally need to simulate every human (or cow, pig, etc.) in order to get good global outcomes.