Thanks for this! I might tweak claim 1 to the following: The probability that this AI has partly utilitarian values dominates EV calculations. (In a soft sense of “dominates”—i.e., it’s the largest single factor, but not the approximately only factor.)
Argument for this version of the claim over the original one:
From a utilitarian view, partly-utilitarian AI would be “just” a few times less valuable than fully utilitarian AI (for a sufficiently strong version of “partly-utilitarian”).
There’s lots of room for moral trade / win-win compromises between different value systems. For example, common scope-insensitive values and utilitarian values can both get most of what they want. So partly-utilitarian AI could easily be ~similarly valuable (say, half as valuable) as fully utilitarian AI.
And partly-utilitarian AI is more than a few times more likely than fully utilitarian AI to come about.
Most AI developers would be much more likely to make their AI partly utilitarian than fully utilitarian, since this pluralism may better reflect their values and better accommodate (internal and external) political pressures.
Efforts to make pluralistic AI mitigate “race to the bottom” dynamics, by making “losing” much less bad for actors who don’t develop advanced AI first. So pluralistic efforts are significantly more likely to succeed at making aligned AI at all.
Since it’s at worst a factor of a few less valuable and it’s more than a few times more likely, the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
[Edit: following the below discussion, I’m now less confident in the second premise above, so now I’m fairly unsure which of P(AI is fully utilitarian) and P(AI is partly utilitarian) is more important, and I suspect neither is > 10x more important than the other.]
I disagree. I think “partly-utilitarian AI,” in the standard sense of the phrase, would produce orders of magnitude less utility than a system optimizing for utility, just because it’s likely that optimization for [anything else] probably comes apart from optimization for utility at high levels of capability. If we stipulate that “partly-utilitarian AI” makes a decent fraction of the utility of a utilitarian AI, I think such a system is extremely unlikely to exist.
I think “partly-utilitarian AI,” in the standard sense of the phrase, would produce orders of magnitude less utility than a system optimizing for utility, just because it’s likely that optimization for [anything else] probably comes apart from optimization for utility at high levels of capability.
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
(I also don’t yet find your linked argument convincing. It argues that “If a future does not involve optimizing for the good, value is almost certainly near-zero.” I agree, but imo it’s quite a leap to conclude from that that [If a future does not only involve optimizing for the good, value is almost certainly near-zero.])
If we stipulate that “partly-utilitarian AI” makes a decent fraction of the utility of a utilitarian AI, I think such a system is extremely unlikely to exist.
This seems like a possible crux, but I don’t fully understand what you’re saying here. Could you rephrase?
[Added] Pasting this from my reply to Josh:
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
Sure, this would presumably be ~half as utility-producing as a utilitarian AI (unless something weird happened like value being nonlinear in resources, but maybe in that case the AI could flip a coin and do 100-0 or 0-100 instead of 50-50). And maybe this could come about as a result of trade/coordination. But it feels unlikely to me. In particular, while “moral parliament” isn’t fully specified, in my internal version of moral parliament, I would not expect a superintelligence to have much moral uncertainty, and certainly would not expect it to translate that moral uncertainty into using significant resources to optimize for different things. (Note that the original moral parliament post thought you should end up acting as if you’re certain in whatever policy wins a majority of delegates—not use 10% of resources on 10% of delegates.) And if we do anything except optimize for utility (or disutility) with some resources, I think those resources produce about zero utility.
Ah good point, I was thinking of a moral parliament where representation is based on value pluralism rather than moral uncertainty, but I think you’re still right that a moral parliament approach (as originally conceived) wouldn’t produce the outcome I had in mind.
Still, is it that hard for some approach to produce a compromise? (Is the worry that [creating a powerful optimizer that uses significant resources to optimize for different things] is technically hard even if alignment has been solved? Edited to add: My intuition is this isn’t hard conditional on alignment being solved, since e.g. then you could just align the AI to an adequately pluralistic human or set of humans, or maybe directly reward this sort of pluralism in training, but I haven’t thought about it much.)
(A lot of my optimism comes from my assumption that ~all (popularity-weighted) value systems which compete with utilitarianism are at least somewhat scope-insensitive, which makes them easy to mostly satisfy with a small fraction of available resources. Are there any prominent value systems other than utilitarianism that are fully scope-sensitive?)
I agree that utilitarianism’s scope sensitivity means a compromise with less scope-sensitive systems could be highly utilitarian. And this may be very important. (But this seems far from certain to me: if Alice and Bob have equal power and decide to merge systems, and Alice is highly scope-sensitive and Bob isn’t, it seems likely Bob will still demand half of resources/etc., under certain reasonable assumptions. On the other hand, such agents may be able to make more sophisticated trades that provide extra security to the scope-insensitive and extra expected resources to the scope-sensitive.)
Regardless, I think scenarios where (1) a single agent controls ~all influence without needing to compromise or (2) multiple agents converge to the same final goals and so merge without compromise are more likely than scenarios where (3) agents with significant influence and diverse preferences compromise.
(So my answer to your second paragraph is “no, but”)
Good points re: negotiations potentially going poorly for Alice (added: and the potential for good compromise), and also about how I may be underestimating the probability of human values converging.
I still think scenario (1) is not so likely, because:
Any advanced AI will initially be created by a team, in which there will be pressures for at least intra-team compromise (and very possibly also external pressures).
More speculatively: maybe acausal trade will enable & incentivize compromise even if each world is unipolar (assuming there isn’t much convergence across worlds).
Sure. And I would buy that we should be generally uncertain. But note
I don’t expect a team that designs advanced AI to also choose what it optimizes for (and I think this is more clear if we replace “what it optimizes for” with “how it’s deployed,” which seems reasonable pre-superintelligence). And regardless that AI’s successors might have less diverse goals.
Setting aside potential compromise outcomes of acausal trade, what’s decision-relevant now is what future systems that might engage in acausal trade would value, and I instinctively doubt “partly-utilitarian” systems provide much of the expected value from acausal trade. But I’m of course extremely uncertain and not sure exactly how this matters.
Also I’m currently exhausted and tend to adopt soldier mindset when exhausted so what you’re saying is probably more convincing than I’m currently appreciating...
[noticing my excessive soldier mindset at least somewhat, I added a sentence at the end of the first paragraph of my previous comment]
No worries, I was probably doing something similar.
I don’t expect a team that designs advanced AI to also choose what it optimizes for (and I think this is more clear if we replace “what it optimizes for” with “how it’s deployed,” which seems reasonable pre-superintelligence)
Could you say a bit more about where you’re coming from here? (My initial intuition would be: assuming alignment ends up being based on some sort of (amplified) human feedback, doesn’t the AI developer get a lot of choice, through its control over who gives the human feedback and how feedback is aggregated (if there are multiple feedback-givers)?)
I instinctively doubt “partly-utilitarian” systems provide much of the expected value from acausal trade
Ah sorry, to clarify, what I had in mind was mostly that (fully) non-utilitarian systems, by trading with (fully) utilitarian systems, would provide much utilitarian value. (Although on second thought, that doesn’t clearly raise the value of partly utilitarian systems more than it raises the value of fully utilitarian systems. Maybe that’s what you were suggesting?)
I should learn more, and a employees-have-power view is shared by the one person in industry I’ve spoken about this with. But I think it’s less the “team” and more either leadership or whoever deploys the system that gets to choose what values the system’s deployment promotes. I also don’t expect alignment-with-human-values to look at all like amplification-of-asking-humans-about-their-values. Maybe you’re thinking of other kinds of human feedback, but then I don’t think it’s relevant to the AI’s values.
Acausal trade: I need to think about this sometime when I can do so carefully. In particular, I think we need to be careful about ‘providing value’ relative to the baseline of an empty universe vs [a non-utilitarian AI that trades with utilitarian AIs]. (It also might be the case that less scope-sensitive systems won’t be as excited about acausal trade?) For now, I don’t have a position and I’m confused about the decision-relevant upshot.
I agree with Zach Stein-Perlman. I did some BOTECs to justify this (see ‘evaluating outcome 3’). If a reasonable candidate for a ‘partially-utilitarian AI’ leads to an outcome where there are 10 billion happy humans on average per star, then an AI that is using every last Joule of energy to produce positive experiences would produce at least ~ 10^15 times more utility.
Why would a reasonable candidate for a ‘partially-utilitarian AI’ lead to an outcome that’s ~worthless by utilitarian lights? I disagree with that premise—that sounds like a ~non-utilitarian AI to me, not a (nontrivially) partly utilitarian AI.
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
[Added] See also my reply to Zach, in which I write:
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
Yep, I didn’t initially understand you. That’s a great point!
This means the framework I presented in this post is wrong. I agree now with your statement:
the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
I think the framework in this post can be modified to incorporate this and the conclusions are similar. The quantity that dominates the utility calculation is now the expected representation of utilitarianism in the AGI’s values.
The two handles become: (1) The probability of misalignment. (2) The expected representation of utilitarianism in the moral parliament conditional on alignment.
The conclusion of the post, then, should be something like “interventions that increase (2) might be underrated” instead of “interventions that increase the probability of fully utilitarian AGI are underrated.”
On second thought, another potential wrinkle, re:the representation of utilitarianism in the AI’s values. Here are two ways that could be defined:
In some sort of moral parliament, what % of representatives are utilitarian?
How good are outcomes relative to what would be optimal by utilitarian lights?
Arguably the latter definition is the more morally relevant one. The former is related but maybe not linearly. (E.g., if the non-utilitarians in a parliament are all scope-insensitive, maybe utilitarianism just needs 5% representation to get > 50% of what it wants. If that’s the case, then it may make sense to be risk-averse with respect to expected representation, e.g., maximize the chances that some sort of compromise happens at all.)
Thanks! From the other comment thread, now I’m less confident in the moral parliament per se being a great framework, but I’d guess something along those lines should work out.
Thanks for this! I might tweak claim 1 to the following: The probability that this AI has partly utilitarian values dominates EV calculations. (In a soft sense of “dominates”—i.e., it’s the largest single factor, but not the approximately only factor.)
Argument for this version of the claim over the original one:
From a utilitarian view, partly-utilitarian AI would be “just” a few times less valuable than fully utilitarian AI (for a sufficiently strong version of “partly-utilitarian”).
There’s lots of room for moral trade / win-win compromises between different value systems. For example, common scope-insensitive values and utilitarian values can both get most of what they want. So partly-utilitarian AI could easily be ~similarly valuable (say, half as valuable) as fully utilitarian AI.
And partly-utilitarian AI is more than a few times more likely than fully utilitarian AI to come about.
Most AI developers would be much more likely to make their AI partly utilitarian than fully utilitarian, since this pluralism may better reflect their values and better accommodate (internal and external) political pressures.
Efforts to make pluralistic AI mitigate “race to the bottom” dynamics, by making “losing” much less bad for actors who don’t develop advanced AI first. So pluralistic efforts are significantly more likely to succeed at making aligned AI at all.
Since it’s at worst a factor of a few less valuable and it’s more than a few times more likely, the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
[Edit: following the below discussion, I’m now less confident in the second premise above, so now I’m fairly unsure which of P(AI is fully utilitarian) and P(AI is partly utilitarian) is more important, and I suspect neither is > 10x more important than the other.]
I disagree. I think “partly-utilitarian AI,” in the standard sense of the phrase, would produce orders of magnitude less utility than a system optimizing for utility, just because it’s likely that optimization for [anything else] probably comes apart from optimization for utility at high levels of capability. If we stipulate that “partly-utilitarian AI” makes a decent fraction of the utility of a utilitarian AI, I think such a system is extremely unlikely to exist.
Thanks for pushing back!
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
(I also don’t yet find your linked argument convincing. It argues that “If a future does not involve optimizing for the good, value is almost certainly near-zero.” I agree, but imo it’s quite a leap to conclude from that that [If a future does not only involve optimizing for the good, value is almost certainly near-zero.])
This seems like a possible crux, but I don’t fully understand what you’re saying here. Could you rephrase?
[Added] Pasting this from my reply to Josh:
Sure, this would presumably be ~half as utility-producing as a utilitarian AI (unless something weird happened like value being nonlinear in resources, but maybe in that case the AI could flip a coin and do 100-0 or 0-100 instead of 50-50). And maybe this could come about as a result of trade/coordination. But it feels unlikely to me. In particular, while “moral parliament” isn’t fully specified, in my internal version of moral parliament, I would not expect a superintelligence to have much moral uncertainty, and certainly would not expect it to translate that moral uncertainty into using significant resources to optimize for different things. (Note that the original moral parliament post thought you should end up acting as if you’re certain in whatever policy wins a majority of delegates—not use 10% of resources on 10% of delegates.) And if we do anything except optimize for utility (or disutility) with some resources, I think those resources produce about zero utility.
Ah good point, I was thinking of a moral parliament where representation is based on value pluralism rather than moral uncertainty, but I think you’re still right that a moral parliament approach (as originally conceived) wouldn’t produce the outcome I had in mind.
Still, is it that hard for some approach to produce a compromise? (Is the worry that [creating a powerful optimizer that uses significant resources to optimize for different things] is technically hard even if alignment has been solved? Edited to add: My intuition is this isn’t hard conditional on alignment being solved, since e.g. then you could just align the AI to an adequately pluralistic human or set of humans, or maybe directly reward this sort of pluralism in training, but I haven’t thought about it much.)
(A lot of my optimism comes from my assumption that ~all (popularity-weighted) value systems which compete with utilitarianism are at least somewhat scope-insensitive, which makes them easy to mostly satisfy with a small fraction of available resources. Are there any prominent value systems other than utilitarianism that are fully scope-sensitive?)
I agree that utilitarianism’s scope sensitivity means a compromise with less scope-sensitive systems could be highly utilitarian. And this may be very important. (But this seems far from certain to me: if Alice and Bob have equal power and decide to merge systems, and Alice is highly scope-sensitive and Bob isn’t, it seems likely Bob will still demand half of resources/etc., under certain reasonable assumptions. On the other hand, such agents may be able to make more sophisticated trades that provide extra security to the scope-insensitive and extra expected resources to the scope-sensitive.)
Regardless, I think scenarios where (1) a single agent controls ~all influence without needing to compromise or (2) multiple agents converge to the same final goals and so merge without compromise are more likely than scenarios where (3) agents with significant influence and diverse preferences compromise.
(So my answer to your second paragraph is “no, but”)
Good points re: negotiations potentially going poorly for Alice (added: and the potential for good compromise), and also about how I may be underestimating the probability of human values converging.
I still think scenario (1) is not so likely, because:
Any advanced AI will initially be created by a team, in which there will be pressures for at least intra-team compromise (and very possibly also external pressures).
More speculatively: maybe acausal trade will enable & incentivize compromise even if each world is unipolar (assuming there isn’t much convergence across worlds).
Sure. And I would buy that we should be generally uncertain. But note
I don’t expect a team that designs advanced AI to also choose what it optimizes for (and I think this is more clear if we replace “what it optimizes for” with “how it’s deployed,” which seems reasonable pre-superintelligence). And regardless that AI’s successors might have less diverse goals.
Setting aside potential compromise outcomes of acausal trade, what’s decision-relevant now is what future systems that might engage in acausal trade would value, and I instinctively doubt “partly-utilitarian” systems provide much of the expected value from acausal trade. But I’m of course extremely uncertain and not sure exactly how this matters.
Also I’m currently exhausted and tend to adopt soldier mindset when exhausted so what you’re saying is probably more convincing than I’m currently appreciating...
[noticing my excessive soldier mindset at least somewhat, I added a sentence at the end of the first paragraph of my previous comment]
No worries, I was probably doing something similar.
Could you say a bit more about where you’re coming from here? (My initial intuition would be: assuming alignment ends up being based on some sort of (amplified) human feedback, doesn’t the AI developer get a lot of choice, through its control over who gives the human feedback and how feedback is aggregated (if there are multiple feedback-givers)?)
Ah sorry, to clarify, what I had in mind was mostly that (fully) non-utilitarian systems, by trading with (fully) utilitarian systems, would provide much utilitarian value. (Although on second thought, that doesn’t clearly raise the value of partly utilitarian systems more than it raises the value of fully utilitarian systems. Maybe that’s what you were suggesting?)
I should learn more, and a employees-have-power view is shared by the one person in industry I’ve spoken about this with. But I think it’s less the “team” and more either leadership or whoever deploys the system that gets to choose what values the system’s deployment promotes. I also don’t expect alignment-with-human-values to look at all like amplification-of-asking-humans-about-their-values. Maybe you’re thinking of other kinds of human feedback, but then I don’t think it’s relevant to the AI’s values.
Acausal trade: I need to think about this sometime when I can do so carefully. In particular, I think we need to be careful about ‘providing value’ relative to the baseline of an empty universe vs [a non-utilitarian AI that trades with utilitarian AIs]. (It also might be the case that less scope-sensitive systems won’t be as excited about acausal trade?) For now, I don’t have a position and I’m confused about the decision-relevant upshot.
I’d be happy to discuss this on a call sometime.
I’m thinking of ~IDA with a non-adversarial (e.g. truthful) model, but could easily be mistaken. Curious what you’re expecting?
Fair, I’m also confused.
Sure! I’ll follow up.
I agree with Zach Stein-Perlman. I did some BOTECs to justify this (see ‘evaluating outcome 3’). If a reasonable candidate for a ‘partially-utilitarian AI’ leads to an outcome where there are 10 billion happy humans on average per star, then an AI that is using every last Joule of energy to produce positive experiences would produce at least ~ 10^15 times more utility.
Why would a reasonable candidate for a ‘partially-utilitarian AI’ lead to an outcome that’s ~worthless by utilitarian lights? I disagree with that premise—that sounds like a ~non-utilitarian AI to me, not a (nontrivially) partly utilitarian AI.
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
[Added] See also my reply to Zach, in which I write:
Yep, I didn’t initially understand you. That’s a great point!
This means the framework I presented in this post is wrong. I agree now with your statement:
I think the framework in this post can be modified to incorporate this and the conclusions are similar. The quantity that dominates the utility calculation is now the expected representation of utilitarianism in the AGI’s values.
The two handles become:
(1) The probability of misalignment.
(2) The expected representation of utilitarianism in the moral parliament conditional on alignment.
The conclusion of the post, then, should be something like “interventions that increase (2) might be underrated” instead of “interventions that increase the probability of fully utilitarian AGI are underrated.”
On second thought, another potential wrinkle, re:the representation of utilitarianism in the AI’s values. Here are two ways that could be defined:
In some sort of moral parliament, what % of representatives are utilitarian?
How good are outcomes relative to what would be optimal by utilitarian lights?
Arguably the latter definition is the more morally relevant one. The former is related but maybe not linearly. (E.g., if the non-utilitarians in a parliament are all scope-insensitive, maybe utilitarianism just needs 5% representation to get > 50% of what it wants. If that’s the case, then it may make sense to be risk-averse with respect to expected representation, e.g., maximize the chances that some sort of compromise happens at all.)
Thanks! From the other comment thread, now I’m less confident in the moral parliament per se being a great framework, but I’d guess something along those lines should work out.