Why would a reasonable candidate for a ‘partially-utilitarian AI’ lead to an outcome that’s ~worthless by utilitarian lights? I disagree with that premise—that sounds like a ~non-utilitarian AI to me, not a (nontrivially) partly utilitarian AI.
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
[Added] See also my reply to Zach, in which I write:
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
Yep, I didn’t initially understand you. That’s a great point!
This means the framework I presented in this post is wrong. I agree now with your statement:
the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
I think the framework in this post can be modified to incorporate this and the conclusions are similar. The quantity that dominates the utility calculation is now the expected representation of utilitarianism in the AGI’s values.
The two handles become: (1) The probability of misalignment. (2) The expected representation of utilitarianism in the moral parliament conditional on alignment.
The conclusion of the post, then, should be something like “interventions that increase (2) might be underrated” instead of “interventions that increase the probability of fully utilitarian AGI are underrated.”
On second thought, another potential wrinkle, re:the representation of utilitarianism in the AI’s values. Here are two ways that could be defined:
In some sort of moral parliament, what % of representatives are utilitarian?
How good are outcomes relative to what would be optimal by utilitarian lights?
Arguably the latter definition is the more morally relevant one. The former is related but maybe not linearly. (E.g., if the non-utilitarians in a parliament are all scope-insensitive, maybe utilitarianism just needs 5% representation to get > 50% of what it wants. If that’s the case, then it may make sense to be risk-averse with respect to expected representation, e.g., maximize the chances that some sort of compromise happens at all.)
Thanks! From the other comment thread, now I’m less confident in the moral parliament per se being a great framework, but I’d guess something along those lines should work out.
Why would a reasonable candidate for a ‘partially-utilitarian AI’ lead to an outcome that’s ~worthless by utilitarian lights? I disagree with that premise—that sounds like a ~non-utilitarian AI to me, not a (nontrivially) partly utilitarian AI.
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
[Added] See also my reply to Zach, in which I write:
Yep, I didn’t initially understand you. That’s a great point!
This means the framework I presented in this post is wrong. I agree now with your statement:
I think the framework in this post can be modified to incorporate this and the conclusions are similar. The quantity that dominates the utility calculation is now the expected representation of utilitarianism in the AGI’s values.
The two handles become:
(1) The probability of misalignment.
(2) The expected representation of utilitarianism in the moral parliament conditional on alignment.
The conclusion of the post, then, should be something like “interventions that increase (2) might be underrated” instead of “interventions that increase the probability of fully utilitarian AGI are underrated.”
On second thought, another potential wrinkle, re:the representation of utilitarianism in the AI’s values. Here are two ways that could be defined:
In some sort of moral parliament, what % of representatives are utilitarian?
How good are outcomes relative to what would be optimal by utilitarian lights?
Arguably the latter definition is the more morally relevant one. The former is related but maybe not linearly. (E.g., if the non-utilitarians in a parliament are all scope-insensitive, maybe utilitarianism just needs 5% representation to get > 50% of what it wants. If that’s the case, then it may make sense to be risk-averse with respect to expected representation, e.g., maximize the chances that some sort of compromise happens at all.)
Thanks! From the other comment thread, now I’m less confident in the moral parliament per se being a great framework, but I’d guess something along those lines should work out.