Russels’ assumption that “The machine’s only objective is to maximize the realization of human preferences” seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
One might respond that Russel is giving advice to humans and humans should maximize human preferences, since we should all maximize our own preferences. Thus, he isn’t assuming that there is anything morally special about humans and his position is therefore not speciestic. I respond, that maximizing my own prefrences and maximizing human preferences are very different objectives, since there are many humans other than myself. This defence therefore rests on a mischaracterization of Russel’s assumption (at least as you outlined it). Furthermore, the assumption that we should maximize our own preferences seems anyway arbitrary and unsurported.
You write that “There are some mechanics that can be deployed to achieve [an AI following the guidelines]. These include game theory, utilitarian ethics, and an understanding of human psychology.”
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone’s wellbeing into account, human or otherwise. I also doubt that it supports the maximization of the agent’s own preferences, where “the agent” is assumed to be an individual human, since human preferences have non-utilitarian features. The precise nature of these features depends on what exactly you mean by “preference,” so let me illustrate the point with some sensible-sounding definitions of “preference”.
(A) An agent is said to prefer x over y, iff he would choose the certain outcome x over the certain outcome y, when given the option.
This makes it tautological that agents maximizes their preferences, when the necessary factual information is availeble. However, people often behave in non-utilitarian ways even if they posses all the relevant factual information. They may e.g. use their money on luxeries instead of donations, or they may support factory farming by buying its products.
(B) An agent is said to prefer x over y, iff he has an urge/craving towards doing x instead of doing y. To put it in other words, the agent would have to muster some strength of will, if he is to avoid doing x instead of y.
People’s cravings/urges can often lead them in non-utilitarian directions (think e.g. of a drug addict who would be better of he could muster the will to quit the drugs).
(C) An agent is said to prefer x over y, iff the feelings/emotions/passions that motivate him towards x are more intense, than those which motivate him towards y. The intensity is here assumed to be some consciously felt feature of the feelings.
Warm glow giving is, by definition, motivated by our feelings/emotions. However, it usually has fairly little impact upon aggragate happiness, so uttilitarianism doesn’t recommend it.
(D) An agent is said to prefer x over y, iff he values x more than y.
This definition prompts the question “what does ‘valuing’ refer to?”. One possible answer is to define “valuing” like (C), but (C) has already been dealt with. Another option is the following.
(E) An agent values is x more than y, iff he believes it to be more valuable.
This would make preference-maximization compatible with uttilitarianism, insofar as the agent believes in utilitarism and lacks beliefs that contradict utilitarianism. However, it would also be compatible with any other moral theory whatsoever, so long as we make the analogous assumptions on behalf of that theory.
It seems worth adding two more comments about (E). First, unlike (A), (B) and (C) it introduces a rationale for maximizing one’s prefernces. We cannot act on an unknown truth, but only on what we believe to be true. Thus, we must act on our moral beliefs, rather than some unknown moral truth.
Second, (E) seems like a bad analysis of “preference,” for although moral views have some preference-like features (specifically, they can motivate behavior), they also have some features, that are more belief-like, than preference-like. They can e.g. serve as premises or conclusions in arguments, one can have credences in them and they can be the subjectmatter of questions.
If “human-compatible” means anything non-speciesistic, then I agree that it is an unfortunate phrase, since it is misleading. I also think it is misleading to call idealized preferences for “human values,” since humans don’t actually hold those preferences, as you correctly point out.
You write that
Let X be the claims, which you deny in this quote. If X is taken litterally, then it is a straw man, since no one believes in it. If X is metaphorical, then it is very unclear what its supposed to mean or whether it means anything. The claim that “ethics is encoded somewhere in the universe” is also unclear. My best attempt to ascribe meaning to it is as follows “there is some entity in the universe, which constitutes all of ethics,” but claims seems false. The most basic ethical principles is, I believe, in some ways like logical principles. The validity of the argument “p and q, therefore p” is not constituted by any feature of the universe. To see this, imagine an alternative universe, which differs from the real in basically any way you like. It’s governed by different laws of nature, contains different lifeforms (or perhaps no life at all) has a different cosmological history etc. If this universe had been real, then “p and q, therefore p” would still be valid. Basic ethical principles like the claim that the suffering is bad, seems just like this. If human preferences (or other features of the universe) where to be different, then suffering would still be bad.