I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone’s wellbeing into account, human or otherwise.
The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by ‘human values’ here isn’t on your list; it’s closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.
Russels’ assumption that “The machine’s only objective is to maximize the realization of human preferences” seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
The language of “human-compatible” is very speciesist, since ethically we should want AGI to be “compatible” with all moral patients, human or not.
On the other hand, the idea of using human brains as a “starting point” for identifying what’s moral makes sense. “Which ethical system is correct?” isn’t written in the stars or in Plato’s heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
The same is true for identifying the right notion of “impartial”, “fair”, “compassionate”, “taking other species’ welfare into account”, etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You’d then need to learn facts about non-humans’ brains in order to implement the resultant impartiality procedure (because the relevant criterion, “impartiality”, says that whether you have human DNA is utterly irrelevant to moral conduct).
The need to bootstrap from values encoded in our brains doesn’t and shouldn’t mean that humans are the only moral patients (or even that we’re particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence “human-compatible” is an unfortunate phrase here.
But it does mean that if, e.g., it turns out that cats’ ultimate true preferences are to torture all species forever, we shouldn’t give that particular preference equal decision weight. Speaking very loosely, the goal is more like ‘ensuring all beings gets to have a good life’, not like ‘ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live’.
If there’s a more benevolent species than humans, I’d hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we’re already building an alien species to defer to if we’re constructing ‘an idealized version of human preferences’, since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)
I think it’s reasonable to worry that given humans’ flaws, humans might not in fact build AGI that ‘ensures all beings get to have a good life’. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that ‘the goal’, and what we would need to investigate in order to work out all the wrinkles and implementation details, I’m forced to initially point to facts about human (if only to identify the right notions of ‘what a moral patient is’ and ‘how one ought to impartially take into account all moral patients’ welfare’).
If “human-compatible” means anything non-speciesistic, then I agree that it is an unfortunate phrase, since it is misleading. I also think it is misleading to call idealized preferences for “human values,” since humans don’t actually hold those preferences, as you correctly point out.
You write that
“Which ethical system is correct?” isn’t written in the stars or in Plato’s heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
Let X be the claims, which you deny in this quote. If X is taken litterally, then it is a straw man, since no one believes in it. If X is metaphorical, then it is very unclear what its supposed to mean or whether it means anything. The claim that “ethics is encoded somewhere in the universe” is also unclear. My best attempt to ascribe meaning to it is as follows “there is some entity in the universe, which constitutes all of ethics,” but claims seems false. The most basic ethical principles is, I believe, in some ways like logical principles. The validity of the argument “p and q, therefore p” is not constituted by any feature of the universe. To see this, imagine an alternative universe, which differs from the real in basically any way you like. It’s governed by different laws of nature, contains different lifeforms (or perhaps no life at all) has a different cosmological history etc. If this universe had been real, then “p and q, therefore p” would still be valid. Basic ethical principles like the claim that the suffering is bad, seems just like this. If human preferences (or other features of the universe) where to be different, then suffering would still be bad.
What explains why “suffering is bad” is true in all universes? How could an agent realistically discover this truth—how do we filter out the false moral claims and zero in on the true ones, and how could an alien do the same?
The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by ‘human values’ here isn’t on your list; it’s closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.
The language of “human-compatible” is very speciesist, since ethically we should want AGI to be “compatible” with all moral patients, human or not.
On the other hand, the idea of using human brains as a “starting point” for identifying what’s moral makes sense. “Which ethical system is correct?” isn’t written in the stars or in Plato’s heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
The same is true for identifying the right notion of “impartial”, “fair”, “compassionate”, “taking other species’ welfare into account”, etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You’d then need to learn facts about non-humans’ brains in order to implement the resultant impartiality procedure (because the relevant criterion, “impartiality”, says that whether you have human DNA is utterly irrelevant to moral conduct).
The need to bootstrap from values encoded in our brains doesn’t and shouldn’t mean that humans are the only moral patients (or even that we’re particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence “human-compatible” is an unfortunate phrase here.
But it does mean that if, e.g., it turns out that cats’ ultimate true preferences are to torture all species forever, we shouldn’t give that particular preference equal decision weight. Speaking very loosely, the goal is more like ‘ensuring all beings gets to have a good life’, not like ‘ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live’.
If there’s a more benevolent species than humans, I’d hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we’re already building an alien species to defer to if we’re constructing ‘an idealized version of human preferences’, since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)
I think it’s reasonable to worry that given humans’ flaws, humans might not in fact build AGI that ‘ensures all beings get to have a good life’. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that ‘the goal’, and what we would need to investigate in order to work out all the wrinkles and implementation details, I’m forced to initially point to facts about human (if only to identify the right notions of ‘what a moral patient is’ and ‘how one ought to impartially take into account all moral patients’ welfare’).
If “human-compatible” means anything non-speciesistic, then I agree that it is an unfortunate phrase, since it is misleading. I also think it is misleading to call idealized preferences for “human values,” since humans don’t actually hold those preferences, as you correctly point out.
You write that
Let X be the claims, which you deny in this quote. If X is taken litterally, then it is a straw man, since no one believes in it. If X is metaphorical, then it is very unclear what its supposed to mean or whether it means anything. The claim that “ethics is encoded somewhere in the universe” is also unclear. My best attempt to ascribe meaning to it is as follows “there is some entity in the universe, which constitutes all of ethics,” but claims seems false. The most basic ethical principles is, I believe, in some ways like logical principles. The validity of the argument “p and q, therefore p” is not constituted by any feature of the universe. To see this, imagine an alternative universe, which differs from the real in basically any way you like. It’s governed by different laws of nature, contains different lifeforms (or perhaps no life at all) has a different cosmological history etc. If this universe had been real, then “p and q, therefore p” would still be valid. Basic ethical principles like the claim that the suffering is bad, seems just like this. If human preferences (or other features of the universe) where to be different, then suffering would still be bad.
I agree that suffering is bad in all universes, for the reasons described in https://www.lesswrong.com/posts/zqwWicCLNBSA5Ssmn/by-which-it-may-be-judged. I’d say that “ethics… is not constituted by any feature of the universe” in the sense you note, but I’d point to our human brains if we were asking any question like:
What explains why “suffering is bad” is true in all universes? How could an agent realistically discover this truth—how do we filter out the false moral claims and zero in on the true ones, and how could an alien do the same?