Russels’ assumption that “The machine’s only objective is to maximize the realization of human preferences” seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
One might respond that Russel is giving advice to humans and humans should maximize human preferences, since we should all maximize our own preferences. Thus, he isn’t assuming that there is anything morally special about humans and his position is therefore not speciestic. I respond, that maximizing my own prefrences and maximizing human preferences are very different objectives, since there are many humans other than myself. This defence therefore rests on a mischaracterization of Russel’s assumption (at least as you outlined it). Furthermore, the assumption that we should maximize our own preferences seems anyway arbitrary and unsurported.
You write that “There are some mechanics that can be deployed to achieve [an AI following the guidelines]. These include game theory, utilitarian ethics, and an understanding of human psychology.”
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone’s wellbeing into account, human or otherwise. I also doubt that it supports the maximization of the agent’s own preferences, where “the agent” is assumed to be an individual human, since human preferences have non-utilitarian features. The precise nature of these features depends on what exactly you mean by “preference,” so let me illustrate the point with some sensible-sounding definitions of “preference”.
(A) An agent is said to prefer x over y, iff he would choose the certain outcome x over the certain outcome y, when given the option.
This makes it tautological that agents maximizes their preferences, when the necessary factual information is availeble. However, people often behave in non-utilitarian ways even if they posses all the relevant factual information. They may e.g. use their money on luxeries instead of donations, or they may support factory farming by buying its products.
(B) An agent is said to prefer x over y, iff he has an urge/craving towards doing x instead of doing y. To put it in other words, the agent would have to muster some strength of will, if he is to avoid doing x instead of y.
People’s cravings/urges can often lead them in non-utilitarian directions (think e.g. of a drug addict who would be better of he could muster the will to quit the drugs).
(C) An agent is said to prefer x over y, iff the feelings/emotions/passions that motivate him towards x are more intense, than those which motivate him towards y. The intensity is here assumed to be some consciously felt feature of the feelings.
Warm glow giving is, by definition, motivated by our feelings/emotions. However, it usually has fairly little impact upon aggragate happiness, so uttilitarianism doesn’t recommend it.
(D) An agent is said to prefer x over y, iff he values x more than y.
This definition prompts the question “what does ‘valuing’ refer to?”. One possible answer is to define “valuing” like (C), but (C) has already been dealt with. Another option is the following.
(E) An agent values is x more than y, iff he believes it to be more valuable.
This would make preference-maximization compatible with uttilitarianism, insofar as the agent believes in utilitarism and lacks beliefs that contradict utilitarianism. However, it would also be compatible with any other moral theory whatsoever, so long as we make the analogous assumptions on behalf of that theory.
It seems worth adding two more comments about (E). First, unlike (A), (B) and (C) it introduces a rationale for maximizing one’s prefernces. We cannot act on an unknown truth, but only on what we believe to be true. Thus, we must act on our moral beliefs, rather than some unknown moral truth.
Second, (E) seems like a bad analysis of “preference,” for although moral views have some preference-like features (specifically, they can motivate behavior), they also have some features, that are more belief-like, than preference-like. They can e.g. serve as premises or conclusions in arguments, one can have credences in them and they can be the subjectmatter of questions.
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone’s wellbeing into account, human or otherwise.
The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by ‘human values’ here isn’t on your list; it’s closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.
Russels’ assumption that “The machine’s only objective is to maximize the realization of human preferences” seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
The language of “human-compatible” is very speciesist, since ethically we should want AGI to be “compatible” with all moral patients, human or not.
On the other hand, the idea of using human brains as a “starting point” for identifying what’s moral makes sense. “Which ethical system is correct?” isn’t written in the stars or in Plato’s heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
The same is true for identifying the right notion of “impartial”, “fair”, “compassionate”, “taking other species’ welfare into account”, etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You’d then need to learn facts about non-humans’ brains in order to implement the resultant impartiality procedure (because the relevant criterion, “impartiality”, says that whether you have human DNA is utterly irrelevant to moral conduct).
The need to bootstrap from values encoded in our brains doesn’t and shouldn’t mean that humans are the only moral patients (or even that we’re particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence “human-compatible” is an unfortunate phrase here.
But it does mean that if, e.g., it turns out that cats’ ultimate true preferences are to torture all species forever, we shouldn’t give that particular preference equal decision weight. Speaking very loosely, the goal is more like ‘ensuring all beings gets to have a good life’, not like ‘ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live’.
If there’s a more benevolent species than humans, I’d hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we’re already building an alien species to defer to if we’re constructing ‘an idealized version of human preferences’, since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)
I think it’s reasonable to worry that given humans’ flaws, humans might not in fact build AGI that ‘ensures all beings get to have a good life’. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that ‘the goal’, and what we would need to investigate in order to work out all the wrinkles and implementation details, I’m forced to initially point to facts about human (if only to identify the right notions of ‘what a moral patient is’ and ‘how one ought to impartially take into account all moral patients’ welfare’).
If “human-compatible” means anything non-speciesistic, then I agree that it is an unfortunate phrase, since it is misleading. I also think it is misleading to call idealized preferences for “human values,” since humans don’t actually hold those preferences, as you correctly point out.
You write that
“Which ethical system is correct?” isn’t written in the stars or in Plato’s heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
Let X be the claims, which you deny in this quote. If X is taken litterally, then it is a straw man, since no one believes in it. If X is metaphorical, then it is very unclear what its supposed to mean or whether it means anything. The claim that “ethics is encoded somewhere in the universe” is also unclear. My best attempt to ascribe meaning to it is as follows “there is some entity in the universe, which constitutes all of ethics,” but claims seems false. The most basic ethical principles is, I believe, in some ways like logical principles. The validity of the argument “p and q, therefore p” is not constituted by any feature of the universe. To see this, imagine an alternative universe, which differs from the real in basically any way you like. It’s governed by different laws of nature, contains different lifeforms (or perhaps no life at all) has a different cosmological history etc. If this universe had been real, then “p and q, therefore p” would still be valid. Basic ethical principles like the claim that the suffering is bad, seems just like this. If human preferences (or other features of the universe) where to be different, then suffering would still be bad.
What explains why “suffering is bad” is true in all universes? How could an agent realistically discover this truth—how do we filter out the false moral claims and zero in on the true ones, and how could an alien do the same?
Russels’ assumption that “The machine’s only objective is to maximize the realization of human preferences” seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
One might respond that Russel is giving advice to humans and humans should maximize human preferences, since we should all maximize our own preferences. Thus, he isn’t assuming that there is anything morally special about humans and his position is therefore not speciestic. I respond, that maximizing my own prefrences and maximizing human preferences are very different objectives, since there are many humans other than myself. This defence therefore rests on a mischaracterization of Russel’s assumption (at least as you outlined it). Furthermore, the assumption that we should maximize our own preferences seems anyway arbitrary and unsurported.
You write that “There are some mechanics that can be deployed to achieve [an AI following the guidelines]. These include game theory, utilitarian ethics, and an understanding of human psychology.”
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone’s wellbeing into account, human or otherwise. I also doubt that it supports the maximization of the agent’s own preferences, where “the agent” is assumed to be an individual human, since human preferences have non-utilitarian features. The precise nature of these features depends on what exactly you mean by “preference,” so let me illustrate the point with some sensible-sounding definitions of “preference”.
(A) An agent is said to prefer x over y, iff he would choose the certain outcome x over the certain outcome y, when given the option.
This makes it tautological that agents maximizes their preferences, when the necessary factual information is availeble. However, people often behave in non-utilitarian ways even if they posses all the relevant factual information. They may e.g. use their money on luxeries instead of donations, or they may support factory farming by buying its products.
(B) An agent is said to prefer x over y, iff he has an urge/craving towards doing x instead of doing y. To put it in other words, the agent would have to muster some strength of will, if he is to avoid doing x instead of y.
People’s cravings/urges can often lead them in non-utilitarian directions (think e.g. of a drug addict who would be better of he could muster the will to quit the drugs).
(C) An agent is said to prefer x over y, iff the feelings/emotions/passions that motivate him towards x are more intense, than those which motivate him towards y. The intensity is here assumed to be some consciously felt feature of the feelings.
Warm glow giving is, by definition, motivated by our feelings/emotions. However, it usually has fairly little impact upon aggragate happiness, so uttilitarianism doesn’t recommend it.
(D) An agent is said to prefer x over y, iff he values x more than y.
This definition prompts the question “what does ‘valuing’ refer to?”. One possible answer is to define “valuing” like (C), but (C) has already been dealt with. Another option is the following.
(E) An agent values is x more than y, iff he believes it to be more valuable.
This would make preference-maximization compatible with uttilitarianism, insofar as the agent believes in utilitarism and lacks beliefs that contradict utilitarianism. However, it would also be compatible with any other moral theory whatsoever, so long as we make the analogous assumptions on behalf of that theory.
It seems worth adding two more comments about (E). First, unlike (A), (B) and (C) it introduces a rationale for maximizing one’s prefernces. We cannot act on an unknown truth, but only on what we believe to be true. Thus, we must act on our moral beliefs, rather than some unknown moral truth.
Second, (E) seems like a bad analysis of “preference,” for although moral views have some preference-like features (specifically, they can motivate behavior), they also have some features, that are more belief-like, than preference-like. They can e.g. serve as premises or conclusions in arguments, one can have credences in them and they can be the subjectmatter of questions.
The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by ‘human values’ here isn’t on your list; it’s closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.
The language of “human-compatible” is very speciesist, since ethically we should want AGI to be “compatible” with all moral patients, human or not.
On the other hand, the idea of using human brains as a “starting point” for identifying what’s moral makes sense. “Which ethical system is correct?” isn’t written in the stars or in Plato’s heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
The same is true for identifying the right notion of “impartial”, “fair”, “compassionate”, “taking other species’ welfare into account”, etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You’d then need to learn facts about non-humans’ brains in order to implement the resultant impartiality procedure (because the relevant criterion, “impartiality”, says that whether you have human DNA is utterly irrelevant to moral conduct).
The need to bootstrap from values encoded in our brains doesn’t and shouldn’t mean that humans are the only moral patients (or even that we’re particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence “human-compatible” is an unfortunate phrase here.
But it does mean that if, e.g., it turns out that cats’ ultimate true preferences are to torture all species forever, we shouldn’t give that particular preference equal decision weight. Speaking very loosely, the goal is more like ‘ensuring all beings gets to have a good life’, not like ‘ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live’.
If there’s a more benevolent species than humans, I’d hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we’re already building an alien species to defer to if we’re constructing ‘an idealized version of human preferences’, since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)
I think it’s reasonable to worry that given humans’ flaws, humans might not in fact build AGI that ‘ensures all beings get to have a good life’. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that ‘the goal’, and what we would need to investigate in order to work out all the wrinkles and implementation details, I’m forced to initially point to facts about human (if only to identify the right notions of ‘what a moral patient is’ and ‘how one ought to impartially take into account all moral patients’ welfare’).
If “human-compatible” means anything non-speciesistic, then I agree that it is an unfortunate phrase, since it is misleading. I also think it is misleading to call idealized preferences for “human values,” since humans don’t actually hold those preferences, as you correctly point out.
You write that
Let X be the claims, which you deny in this quote. If X is taken litterally, then it is a straw man, since no one believes in it. If X is metaphorical, then it is very unclear what its supposed to mean or whether it means anything. The claim that “ethics is encoded somewhere in the universe” is also unclear. My best attempt to ascribe meaning to it is as follows “there is some entity in the universe, which constitutes all of ethics,” but claims seems false. The most basic ethical principles is, I believe, in some ways like logical principles. The validity of the argument “p and q, therefore p” is not constituted by any feature of the universe. To see this, imagine an alternative universe, which differs from the real in basically any way you like. It’s governed by different laws of nature, contains different lifeforms (or perhaps no life at all) has a different cosmological history etc. If this universe had been real, then “p and q, therefore p” would still be valid. Basic ethical principles like the claim that the suffering is bad, seems just like this. If human preferences (or other features of the universe) where to be different, then suffering would still be bad.
I agree that suffering is bad in all universes, for the reasons described in https://www.lesswrong.com/posts/zqwWicCLNBSA5Ssmn/by-which-it-may-be-judged. I’d say that “ethics… is not constituted by any feature of the universe” in the sense you note, but I’d point to our human brains if we were asking any question like:
What explains why “suffering is bad” is true in all universes? How could an agent realistically discover this truth—how do we filter out the false moral claims and zero in on the true ones, and how could an alien do the same?