It would be totally reasonable for the author to discuss s-risks. But only some s-risks are very concerning to utilitarians—for example, utilitarians don’t worry much about the s-risk of 10^30 suffering people in a universe with 10^40 flourishing people. And it’s not clear that utilitarian catastrophes are anywhere near as likely as the possible outcomes the author discusses. This post is written for utilitarians, and I’m not aware of arguments that it’s reasonably likely that the future is bad on a scale comparable to the goodness of “utilitarian AGI” (from a utilitarian perspective).
But only some s-risks are very concerning to utilitarians—for example, utilitarians don’t worry much about the s-risk of 10^30 suffering people in a universe with 10^40 flourishing people.
Utilitarianism =/= classical utilitarianism. I’m a utilitarian who would think that outcome is extremely awful. It depends on the axiology.
I was a bit surprised to see your ‘mediocre’ outcome defined thus:
The superintelligence is aligned to non-utilitarian values (probably normal human values) … [H]umans will populate all reachable galaxies and there will be 10 billion happy humans per star.
Having a superintelligence aligned to normal human values seems like a big win to me!
Given this somewhat unconventional definition of mediocre, it seems like this article is basically advocating for defecting in a prisoners dilemma. Yes, it is better for utilitarians if everyone else collaborates on avoiding extinction, while being much worse for everyone else, and allows the utilitarians to free-ride and instead focus on promoting their own values. But adopting this strategy seems quite hostile to the rest of humanity, and if everyone did adopted it (e.g. muslims focused on trying to promote AGI-sharia rather than reducing extinction) we might all end up worse off (extinct). ‘Normal human values’, which includes utility, seems like a natural schelling point for collaboration and cooperation.
I agree that there might be reasons of moral cooperation and trade to compromise with other value systems. But when deciding how to cooperate, we should at least be explicitly guided by optimising for our own values, subject to constraints. I think it is far from obvious that aligning with the intent of the programmer is the best way to optimise for utilitarian values. Perhaps we should aim for utilitarian alignment first
Having a superintelligence aligned to normal human values seems like a big win to me!
Not super sure what this means but the ‘normal human values’ outcome as I’ve defined it hardly contributes to EV calculations at all compared to the utopia outcome. If you disagree with this, please look at the math and let me know if I made a mistake.
Sure. The math is clearly very handwavy, but I think there are basically two issues.
Firstly, the mediocre outcome supposedly involves a superintelligence optimising for normal human values, potentially including simulating people. Yet it only involves 10 billion humans per star, less than we are currently forecast to support on a single un-optimised planet using no simulations, no AGI help and relatively primitive technology. At the very least I would think we should be having massive terraforming and efficient food production to support much higher populations, if not full dyson spheres and simulations. It’s not going to be as many people as the other scenario, but it’ll hopefully be more than Earth2100.
Secondly, I think the utilitarian outcome is over-valued on anything but purely utilitarian criteria. A world of soma-brains, without love, friendship, meaningful challenges etc. would strike many people as quite undesirable.
It seems like it would be relatively easy to make this world significantly better by conventional lights at relatively low utilitarian cost. For example, giving the simulated humans the ability to turn themselves off might incur a positive but small overhead (as presumably very few happy people would take this option), but be a significant improvement by the standards of a conventional ethics which value consent.
Setting aside what this post said, here’s an attitude I think we should be sympathetic to:
There are possible futures that are great by prosaic standards, where all humans are flourishing and so forth. But some of these futures may not be great by the standards that everyone would adopt if we were smarter, wiser, better-informed, and so forth (which the author happens to believe is utilitarianism). Insofar as the latter is much more choice-worthy in expectation than the former, we should have great concern for not just ensuring survival, but also that good values are realized in the future. This may require some events happening, or some events happening before others, or some specific coordination, to achieve. Phrased more provocatively, superintelligence aligned with normal human values is a prima facie existential catastrophe, since normal human values probably aren’t really good, or aren’t what we would be promoting if we were wiser/etc. I’m not sure the Schelling point note is relevant—it depends on which agents are coordinating on AI—but if it is, a better Schelling point may be some kind of extrapolation of human values.
Edit: ok, I agree we should be cautious about acting certain in utilitarianism or whatever we may happen to value when those-with-whom-we-should-cooperate disagree.
Yes, I agree with that. I think aiming for some sort of CEV-like system to find such values in the future, via some robustly-not-value-degrading process, seems like a good idea. Hopefully such a process could gain widespread assent. It’s the jumping straight to the (perceived) conclusion I am objecting to.
Thanks for this! I might tweak claim 1 to the following: The probability that this AI has partly utilitarian values dominates EV calculations. (In a soft sense of “dominates”—i.e., it’s the largest single factor, but not the approximately only factor.)
Argument for this version of the claim over the original one:
From a utilitarian view, partly-utilitarian AI would be “just” a few times less valuable than fully utilitarian AI (for a sufficiently strong version of “partly-utilitarian”).
There’s lots of room for moral trade / win-win compromises between different value systems. For example, common scope-insensitive values and utilitarian values can both get most of what they want. So partly-utilitarian AI could easily be ~similarly valuable (say, half as valuable) as fully utilitarian AI.
And partly-utilitarian AI is more than a few times more likely than fully utilitarian AI to come about.
Most AI developers would be much more likely to make their AI partly utilitarian than fully utilitarian, since this pluralism may better reflect their values and better accommodate (internal and external) political pressures.
Efforts to make pluralistic AI mitigate “race to the bottom” dynamics, by making “losing” much less bad for actors who don’t develop advanced AI first. So pluralistic efforts are significantly more likely to succeed at making aligned AI at all.
Since it’s at worst a factor of a few less valuable and it’s more than a few times more likely, the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
[Edit: following the below discussion, I’m now less confident in the second premise above, so now I’m fairly unsure which of P(AI is fully utilitarian) and P(AI is partly utilitarian) is more important, and I suspect neither is > 10x more important than the other.]
I disagree. I think “partly-utilitarian AI,” in the standard sense of the phrase, would produce orders of magnitude less utility than a system optimizing for utility, just because it’s likely that optimization for [anything else] probably comes apart from optimization for utility at high levels of capability. If we stipulate that “partly-utilitarian AI” makes a decent fraction of the utility of a utilitarian AI, I think such a system is extremely unlikely to exist.
I think “partly-utilitarian AI,” in the standard sense of the phrase, would produce orders of magnitude less utility than a system optimizing for utility, just because it’s likely that optimization for [anything else] probably comes apart from optimization for utility at high levels of capability.
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
(I also don’t yet find your linked argument convincing. It argues that “If a future does not involve optimizing for the good, value is almost certainly near-zero.” I agree, but imo it’s quite a leap to conclude from that that [If a future does not only involve optimizing for the good, value is almost certainly near-zero.])
If we stipulate that “partly-utilitarian AI” makes a decent fraction of the utility of a utilitarian AI, I think such a system is extremely unlikely to exist.
This seems like a possible crux, but I don’t fully understand what you’re saying here. Could you rephrase?
[Added] Pasting this from my reply to Josh:
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
Sure, this would presumably be ~half as utility-producing as a utilitarian AI (unless something weird happened like value being nonlinear in resources, but maybe in that case the AI could flip a coin and do 100-0 or 0-100 instead of 50-50). And maybe this could come about as a result of trade/coordination. But it feels unlikely to me. In particular, while “moral parliament” isn’t fully specified, in my internal version of moral parliament, I would not expect a superintelligence to have much moral uncertainty, and certainly would not expect it to translate that moral uncertainty into using significant resources to optimize for different things. (Note that the original moral parliament post thought you should end up acting as if you’re certain in whatever policy wins a majority of delegates—not use 10% of resources on 10% of delegates.) And if we do anything except optimize for utility (or disutility) with some resources, I think those resources produce about zero utility.
Ah good point, I was thinking of a moral parliament where representation is based on value pluralism rather than moral uncertainty, but I think you’re still right that a moral parliament approach (as originally conceived) wouldn’t produce the outcome I had in mind.
Still, is it that hard for some approach to produce a compromise? (Is the worry that [creating a powerful optimizer that uses significant resources to optimize for different things] is technically hard even if alignment has been solved? Edited to add: My intuition is this isn’t hard conditional on alignment being solved, since e.g. then you could just align the AI to an adequately pluralistic human or set of humans, or maybe directly reward this sort of pluralism in training, but I haven’t thought about it much.)
(A lot of my optimism comes from my assumption that ~all (popularity-weighted) value systems which compete with utilitarianism are at least somewhat scope-insensitive, which makes them easy to mostly satisfy with a small fraction of available resources. Are there any prominent value systems other than utilitarianism that are fully scope-sensitive?)
I agree that utilitarianism’s scope sensitivity means a compromise with less scope-sensitive systems could be highly utilitarian. And this may be very important. (But this seems far from certain to me: if Alice and Bob have equal power and decide to merge systems, and Alice is highly scope-sensitive and Bob isn’t, it seems likely Bob will still demand half of resources/etc., under certain reasonable assumptions. On the other hand, such agents may be able to make more sophisticated trades that provide extra security to the scope-insensitive and extra expected resources to the scope-sensitive.)
Regardless, I think scenarios where (1) a single agent controls ~all influence without needing to compromise or (2) multiple agents converge to the same final goals and so merge without compromise are more likely than scenarios where (3) agents with significant influence and diverse preferences compromise.
(So my answer to your second paragraph is “no, but”)
Good points re: negotiations potentially going poorly for Alice (added: and the potential for good compromise), and also about how I may be underestimating the probability of human values converging.
I still think scenario (1) is not so likely, because:
Any advanced AI will initially be created by a team, in which there will be pressures for at least intra-team compromise (and very possibly also external pressures).
More speculatively: maybe acausal trade will enable & incentivize compromise even if each world is unipolar (assuming there isn’t much convergence across worlds).
Sure. And I would buy that we should be generally uncertain. But note
I don’t expect a team that designs advanced AI to also choose what it optimizes for (and I think this is more clear if we replace “what it optimizes for” with “how it’s deployed,” which seems reasonable pre-superintelligence). And regardless that AI’s successors might have less diverse goals.
Setting aside potential compromise outcomes of acausal trade, what’s decision-relevant now is what future systems that might engage in acausal trade would value, and I instinctively doubt “partly-utilitarian” systems provide much of the expected value from acausal trade. But I’m of course extremely uncertain and not sure exactly how this matters.
Also I’m currently exhausted and tend to adopt soldier mindset when exhausted so what you’re saying is probably more convincing than I’m currently appreciating...
[noticing my excessive soldier mindset at least somewhat, I added a sentence at the end of the first paragraph of my previous comment]
No worries, I was probably doing something similar.
I don’t expect a team that designs advanced AI to also choose what it optimizes for (and I think this is more clear if we replace “what it optimizes for” with “how it’s deployed,” which seems reasonable pre-superintelligence)
Could you say a bit more about where you’re coming from here? (My initial intuition would be: assuming alignment ends up being based on some sort of (amplified) human feedback, doesn’t the AI developer get a lot of choice, through its control over who gives the human feedback and how feedback is aggregated (if there are multiple feedback-givers)?)
I instinctively doubt “partly-utilitarian” systems provide much of the expected value from acausal trade
Ah sorry, to clarify, what I had in mind was mostly that (fully) non-utilitarian systems, by trading with (fully) utilitarian systems, would provide much utilitarian value. (Although on second thought, that doesn’t clearly raise the value of partly utilitarian systems more than it raises the value of fully utilitarian systems. Maybe that’s what you were suggesting?)
I should learn more, and a employees-have-power view is shared by the one person in industry I’ve spoken about this with. But I think it’s less the “team” and more either leadership or whoever deploys the system that gets to choose what values the system’s deployment promotes. I also don’t expect alignment-with-human-values to look at all like amplification-of-asking-humans-about-their-values. Maybe you’re thinking of other kinds of human feedback, but then I don’t think it’s relevant to the AI’s values.
Acausal trade: I need to think about this sometime when I can do so carefully. In particular, I think we need to be careful about ‘providing value’ relative to the baseline of an empty universe vs [a non-utilitarian AI that trades with utilitarian AIs]. (It also might be the case that less scope-sensitive systems won’t be as excited about acausal trade?) For now, I don’t have a position and I’m confused about the decision-relevant upshot.
I agree with Zach Stein-Perlman. I did some BOTECs to justify this (see ‘evaluating outcome 3’). If a reasonable candidate for a ‘partially-utilitarian AI’ leads to an outcome where there are 10 billion happy humans on average per star, then an AI that is using every last Joule of energy to produce positive experiences would produce at least ~ 10^15 times more utility.
Why would a reasonable candidate for a ‘partially-utilitarian AI’ lead to an outcome that’s ~worthless by utilitarian lights? I disagree with that premise—that sounds like a ~non-utilitarian AI to me, not a (nontrivially) partly utilitarian AI.
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
[Added] See also my reply to Zach, in which I write:
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
Yep, I didn’t initially understand you. That’s a great point!
This means the framework I presented in this post is wrong. I agree now with your statement:
the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
I think the framework in this post can be modified to incorporate this and the conclusions are similar. The quantity that dominates the utility calculation is now the expected representation of utilitarianism in the AGI’s values.
The two handles become: (1) The probability of misalignment. (2) The expected representation of utilitarianism in the moral parliament conditional on alignment.
The conclusion of the post, then, should be something like “interventions that increase (2) might be underrated” instead of “interventions that increase the probability of fully utilitarian AGI are underrated.”
On second thought, another potential wrinkle, re:the representation of utilitarianism in the AI’s values. Here are two ways that could be defined:
In some sort of moral parliament, what % of representatives are utilitarian?
How good are outcomes relative to what would be optimal by utilitarian lights?
Arguably the latter definition is the more morally relevant one. The former is related but maybe not linearly. (E.g., if the non-utilitarians in a parliament are all scope-insensitive, maybe utilitarianism just needs 5% representation to get > 50% of what it wants. If that’s the case, then it may make sense to be risk-averse with respect to expected representation, e.g., maximize the chances that some sort of compromise happens at all.)
Thanks! From the other comment thread, now I’m less confident in the moral parliament per se being a great framework, but I’d guess something along those lines should work out.
Including that consideration may support a more general focus on ensuring a better quality of the future, which may also be supported by considerations related to grabby aliens.
I see these as like important oversights, rather than areas where the post makes explicit false claims. E.g., the post does acknowledge at the top that it assumes total utilitarianism, but it’s still the case that it seems to assume perfect confidence in total utilitarianism and seems to frame that as reasonable rather than just “for the sake of argument”, and I think that both makes the post less valuable and perhaps more misleading than it could be.
But also, tbc, it’s fair enough to write a post that moves conversations forward in some ways but isn’t perfect!
(I haven’t read the comments, so maybe much of this is covered already.)
I don’t actually think outcome 3 is achievable or particularly desirable. You’re basically asking for an AI that relentlessly cuts any non-optimal resource expenditure in favor of more and more strongly optimizing for the “good”. I think the default result of such a process is that it finds some configuration of matter which is more “happy” / “meaningful” / whatever it’s conception of “good” and sacrifices everything that’s not part of such a conception.
I also don’t think our values are shaped like that. I think a single human’s values derive from a multi agent negotiation among a continuous distribution over possible internal sub agents. This means they’re inherently dynamic, constantly changing in response to your changing cognitive environment. It also means that we essentially have a limitless variety of internal values, whose external expression is limited by our finite resources/ capabilities. Restricting the future’s values to a single, limited snapshot of that process just seems… not good.
Thanks for this, I think it is an important and under-discussed point. In their AI alignment work, EAs seem to be aiming for intent-alignment rather than social welfare production, which I think is plausibly a very large mistake, or at least one that hasn’t received very much scrutiny.
Incidentally, I also don’t know what it means to say that we have aligned AIs with ‘our values’. Since there is disagreement, ‘our’ has no referent here.
That said, fwiw, since I’m recommending Holden’s doc, I should also flag that I think the breakdown of possible outcomes that Holden sketches there isn’t a good one, because:
He defines utopia, dystopia, and “middling worlds” solely by how good they are, whereas “paperclipping” is awkwardly squeezed in with a definition based on how it comes about (namely, that it’s a world run by misaligned AI). This leads two two issues in my view:
I think the classic paperclipping scenario would itself be a “middling” world, yet Holden frames “paperclipping” as a distinct concept from “middling” worlds.
Misaligned AI actually need not lead to something approx. as good/bad as paperclipping; it could instead lead to dystopia, or could maybe lead to utopia, depending on how we define “alignment” and depending on metaethics.
There’s no explicit mention of extinction.
I think Holden is seeing “paperclipping” as synonymous with extinction?
But misaligned AI need not lead to extinction.
And extinction is “middling” relative to utopia and dystopia.
And extinction is also very different from some other “middling” worlds according to many ethical theories (though probably not total utilitarianism).
I would mostly like to protest your notion of utopia. A universe where every gram of matter is used for making brains sounds terrible. A “good” life involves interaction with other brains as well as a living environment.
Yeah, I would be in favor of interaction in simulated environments—other’s might disagree, but I don’t think this influences the general argument very much as I don’t think leaving some matter for computers will reduce the number of brains by more than an order of magnitude or so.
That’s not what I meant. What I tried to say is that the universe is full of beautiful things, like galaxies, plants, hills, dogs… More generally, complex systems with so many interesting things happening on so many scales. When I imagine a utopia, I picture a thriving human society in “harmony”, or at least at peace, with nature. Converting all of it into simulated brains sounds like a dystopian nightmare to me.
Since I first thought about my intrinsic values, I knew there’s some divergence between e.g. valuing beauty and valuing happiness singularly. But I’ve never managed to imagine a scenario where increasing one goes so much against the other, until now.
I think a large part of any hypothetical world being a utopia is that people would like to live in it. I’m not sure if you asked people about this scenario, they would find it favourable.
One key issue is that we very likely do not know enough about what utopia mean or how to achieve it. We also don’t know enough about the current expected value of the long run future (even conditional on survival). And we likely won’t make much progress on these difficult questions before AI or other X risks. Reducing P(extinction) seems to be a necessary condition to be in a position to use safe AI that makes progress on important fields that we need to figure out to understand what Utopia mean and how to increase P(Utopia) (and avoid downside risks such as S-risks in the process).
Example of fields that could be particularly important to point our safe and aligned AI towards :
-Moral philosophy (and in particular to check whether total utilitarianism is correct or if we can update to better alternatives)
-Governance mechanisms and Economics to implement our extrapolated ideal moral system in the world
It might be preferable to focus on reducing P(doom) AND reducing the risks of a premature irreversible race to the universe to give us ample time to use our safe and aligned AI to solve others important problems and make substantial progress in natural, social sciences and philosophy. (a “long reflexion” with AI that does not need to be long on astronomical scales)
The title of this post is a general claim about the long-term future, and yet nowhere in your post do you mention any x-risks other than AI. Why should we not expect other x-risks to outweigh these AGI considerations, since they may not fit into this framework of extinction, ok outcome, utopian outcome? I am not necessarily convinced that pulling the utopia handle on actions related to AGI (like the four you suggest) have a greater effect on P(utopia) than some set of non-AGI-related interventions.
Did your outcomes 2 and 3 get mixed up at some point? I feel like the evaluations don’t align with the initial descriptions of those, but maybe I’m misunderstanding.
Thanks for writing this though, this is something I’ve been thinking a little about as I try to understand longtermism better. It makes sense to be risk-averse with existential risk, but at the same I have a hard time understanding some of the more extreme takes. My wild guess would be that AI has a significantly higher chance of improving the well-being of humanity than it does causing extinction, like I said care is warranted with existential risk but at the same time slowing AI development delays your positive outcomes 2 and 3, and I haven’t seen much discussion about the downsides of delaying.
Also I’m not sure about outcome 1 having zero utility, maybe that’s standard notation but it seems unintuitive to me, like it kind of buries the downsides of extinction risk. To me it would seem more natural as a negative utility, relative to the positive utility currently existing in the world.
...I haven’t seen much discussion about the downsides of delaying
I’m not sure how your first point relates to what I was saying in this post; but, I’ll take a guess. I said something about how investing in capabilities at anthropic could be good. An upside to this would be increasing the probability that EAs end up controlling the super-intelligent AGI in the future. The downside is that it could shorten timelines, but hopefully this can be mitigated by keeping all of the research under wraps (which is what they are doing). This is a controversial issue though. I haven’t thought very much about whether the upsides outweigh the downsides, but the argument in this post caused me to believe the upsides were larger than I thought before.
Also I’m not sure about outcome 1 having zero utility...
It doesn’t matter what outcome you assign zero value to as long as the relative values are the same since if a utility function is an affine function of another utility function then they produce equivalent decisions.
I’m not sure how your first point relates to what I was saying in this post; but, I’ll take a guess.
Sorry, what I said wasn’t very clear. Attempting to rephrase, I was thinking more along the lines of what the possible future for AI might look like if there were no EA interventions in the AI space. I haven’t seen much discussion of the possible downsides there (for example slowing down AI research by prioritizing alignment resulting in delays in AI advancement and delays in good things brought about by AI advancement). But this was a less-than-half-baked idea, thinking about it some more I’m having trouble thinking of scenarios where that could produce a lower expected utility.
It doesn’t matter what outcome you assign zero value to as long as the relative values are the same since if a utility function is an affine function of another utility function then they produce equivalent decisions.
It seems like some discussion of s-risks is called for as they seem to be assumed away, though many longtermists are concerned about them.
It would be totally reasonable for the author to discuss s-risks. But only some s-risks are very concerning to utilitarians—for example, utilitarians don’t worry much about the s-risk of 10^30 suffering people in a universe with 10^40 flourishing people. And it’s not clear that utilitarian catastrophes are anywhere near as likely as the possible outcomes the author discusses. This post is written for utilitarians, and I’m not aware of arguments that it’s reasonably likely that the future is bad on a scale comparable to the goodness of “utilitarian AGI” (from a utilitarian perspective).
Utilitarianism =/= classical utilitarianism. I’m a utilitarian who would think that outcome is extremely awful. It depends on the axiology.
I was a bit surprised to see your ‘mediocre’ outcome defined thus:
Having a superintelligence aligned to normal human values seems like a big win to me!
Given this somewhat unconventional definition of mediocre, it seems like this article is basically advocating for defecting in a prisoners dilemma. Yes, it is better for utilitarians if everyone else collaborates on avoiding extinction, while being much worse for everyone else, and allows the utilitarians to free-ride and instead focus on promoting their own values. But adopting this strategy seems quite hostile to the rest of humanity, and if everyone did adopted it (e.g. muslims focused on trying to promote AGI-sharia rather than reducing extinction) we might all end up worse off (extinct). ‘Normal human values’, which includes utility, seems like a natural schelling point for collaboration and cooperation.
I agree that there might be reasons of moral cooperation and trade to compromise with other value systems. But when deciding how to cooperate, we should at least be explicitly guided by optimising for our own values, subject to constraints. I think it is far from obvious that aligning with the intent of the programmer is the best way to optimise for utilitarian values. Perhaps we should aim for utilitarian alignment first
Not super sure what this means but the ‘normal human values’ outcome as I’ve defined it hardly contributes to EV calculations at all compared to the utopia outcome. If you disagree with this, please look at the math and let me know if I made a mistake.
Sure. The math is clearly very handwavy, but I think there are basically two issues.
Firstly, the mediocre outcome supposedly involves a superintelligence optimising for normal human values, potentially including simulating people. Yet it only involves 10 billion humans per star, less than we are currently forecast to support on a single un-optimised planet using no simulations, no AGI help and relatively primitive technology. At the very least I would think we should be having massive terraforming and efficient food production to support much higher populations, if not full dyson spheres and simulations. It’s not going to be as many people as the other scenario, but it’ll hopefully be more than Earth2100.
Secondly, I think the utilitarian outcome is over-valued on anything but purely utilitarian criteria. A world of soma-brains, without love, friendship, meaningful challenges etc. would strike many people as quite undesirable.
It seems like it would be relatively easy to make this world significantly better by conventional lights at relatively low utilitarian cost. For example, giving the simulated humans the ability to turn themselves off might incur a positive but small overhead (as presumably very few happy people would take this option), but be a significant improvement by the standards of a conventional ethics which value consent.
Setting aside what this post said, here’s an attitude I think we should be sympathetic to:
There are possible futures that are great by prosaic standards, where all humans are flourishing and so forth. But some of these futures may not be great by the standards that everyone would adopt if we were smarter, wiser, better-informed, and so forth (which the author happens to believe is utilitarianism). Insofar as the latter is much more choice-worthy in expectation than the former, we should have great concern for not just ensuring survival, but also that good values are realized in the future. This may require some events happening, or some events happening before others, or some specific coordination, to achieve. Phrased more provocatively, superintelligence aligned with normal human values is a prima facie existential catastrophe, since normal human values probably aren’t really good, or aren’t what we would be promoting if we were wiser/etc. I’m not sure the Schelling point note is relevant—it depends on which agents are coordinating on AI—but if it is, a better Schelling point may be some kind of extrapolation of human values.
Edit: ok, I agree we should be cautious about acting certain in utilitarianism or whatever we may happen to value when those-with-whom-we-should-cooperate disagree.
Yes, I agree with that. I think aiming for some sort of CEV-like system to find such values in the future, via some robustly-not-value-degrading process, seems like a good idea. Hopefully such a process could gain widespread assent. It’s the jumping straight to the (perceived) conclusion I am objecting to.
Thanks for this! I might tweak claim 1 to the following: The probability that this AI has partly utilitarian values dominates EV calculations. (In a soft sense of “dominates”—i.e., it’s the largest single factor, but not the approximately only factor.)
Argument for this version of the claim over the original one:
From a utilitarian view, partly-utilitarian AI would be “just” a few times less valuable than fully utilitarian AI (for a sufficiently strong version of “partly-utilitarian”).
There’s lots of room for moral trade / win-win compromises between different value systems. For example, common scope-insensitive values and utilitarian values can both get most of what they want. So partly-utilitarian AI could easily be ~similarly valuable (say, half as valuable) as fully utilitarian AI.
And partly-utilitarian AI is more than a few times more likely than fully utilitarian AI to come about.
Most AI developers would be much more likely to make their AI partly utilitarian than fully utilitarian, since this pluralism may better reflect their values and better accommodate (internal and external) political pressures.
Efforts to make pluralistic AI mitigate “race to the bottom” dynamics, by making “losing” much less bad for actors who don’t develop advanced AI first. So pluralistic efforts are significantly more likely to succeed at making aligned AI at all.
Since it’s at worst a factor of a few less valuable and it’s more than a few times more likely, the EV of partly utilitarian AI is higher than that of fully utilitarian AI.
[Edit: following the below discussion, I’m now less confident in the second premise above, so now I’m fairly unsure which of P(AI is fully utilitarian) and P(AI is partly utilitarian) is more important, and I suspect neither is > 10x more important than the other.]
I disagree. I think “partly-utilitarian AI,” in the standard sense of the phrase, would produce orders of magnitude less utility than a system optimizing for utility, just because it’s likely that optimization for [anything else] probably comes apart from optimization for utility at high levels of capability. If we stipulate that “partly-utilitarian AI” makes a decent fraction of the utility of a utilitarian AI, I think such a system is extremely unlikely to exist.
Thanks for pushing back!
What about the following counterexample? Suppose a powerful agent optimizes for a mixed objective, which leads to it optimizing ~half of the accessible universe for utilitarianism, the other ~half for some other scope-sensitive value, and a few planets for modal scope-insensitive human values. Then, even at high levels of capability, this universe will be ~half as good by utilitarian lights as a universe that’s fully optimized for utility, even though the optimizer wasn’t just optimizing for utility. (If you doubt whether there exist utility functions that would lead to roughly these outcomes, I’m happy to make arguments for that assumption.)
(I also don’t yet find your linked argument convincing. It argues that “If a future does not involve optimizing for the good, value is almost certainly near-zero.” I agree, but imo it’s quite a leap to conclude from that that [If a future does not only involve optimizing for the good, value is almost certainly near-zero.])
This seems like a possible crux, but I don’t fully understand what you’re saying here. Could you rephrase?
[Added] Pasting this from my reply to Josh:
Sure, this would presumably be ~half as utility-producing as a utilitarian AI (unless something weird happened like value being nonlinear in resources, but maybe in that case the AI could flip a coin and do 100-0 or 0-100 instead of 50-50). And maybe this could come about as a result of trade/coordination. But it feels unlikely to me. In particular, while “moral parliament” isn’t fully specified, in my internal version of moral parliament, I would not expect a superintelligence to have much moral uncertainty, and certainly would not expect it to translate that moral uncertainty into using significant resources to optimize for different things. (Note that the original moral parliament post thought you should end up acting as if you’re certain in whatever policy wins a majority of delegates—not use 10% of resources on 10% of delegates.) And if we do anything except optimize for utility (or disutility) with some resources, I think those resources produce about zero utility.
Ah good point, I was thinking of a moral parliament where representation is based on value pluralism rather than moral uncertainty, but I think you’re still right that a moral parliament approach (as originally conceived) wouldn’t produce the outcome I had in mind.
Still, is it that hard for some approach to produce a compromise? (Is the worry that [creating a powerful optimizer that uses significant resources to optimize for different things] is technically hard even if alignment has been solved? Edited to add: My intuition is this isn’t hard conditional on alignment being solved, since e.g. then you could just align the AI to an adequately pluralistic human or set of humans, or maybe directly reward this sort of pluralism in training, but I haven’t thought about it much.)
(A lot of my optimism comes from my assumption that ~all (popularity-weighted) value systems which compete with utilitarianism are at least somewhat scope-insensitive, which makes them easy to mostly satisfy with a small fraction of available resources. Are there any prominent value systems other than utilitarianism that are fully scope-sensitive?)
I agree that utilitarianism’s scope sensitivity means a compromise with less scope-sensitive systems could be highly utilitarian. And this may be very important. (But this seems far from certain to me: if Alice and Bob have equal power and decide to merge systems, and Alice is highly scope-sensitive and Bob isn’t, it seems likely Bob will still demand half of resources/etc., under certain reasonable assumptions. On the other hand, such agents may be able to make more sophisticated trades that provide extra security to the scope-insensitive and extra expected resources to the scope-sensitive.)
Regardless, I think scenarios where (1) a single agent controls ~all influence without needing to compromise or (2) multiple agents converge to the same final goals and so merge without compromise are more likely than scenarios where (3) agents with significant influence and diverse preferences compromise.
(So my answer to your second paragraph is “no, but”)
Good points re: negotiations potentially going poorly for Alice (added: and the potential for good compromise), and also about how I may be underestimating the probability of human values converging.
I still think scenario (1) is not so likely, because:
Any advanced AI will initially be created by a team, in which there will be pressures for at least intra-team compromise (and very possibly also external pressures).
More speculatively: maybe acausal trade will enable & incentivize compromise even if each world is unipolar (assuming there isn’t much convergence across worlds).
Sure. And I would buy that we should be generally uncertain. But note
I don’t expect a team that designs advanced AI to also choose what it optimizes for (and I think this is more clear if we replace “what it optimizes for” with “how it’s deployed,” which seems reasonable pre-superintelligence). And regardless that AI’s successors might have less diverse goals.
Setting aside potential compromise outcomes of acausal trade, what’s decision-relevant now is what future systems that might engage in acausal trade would value, and I instinctively doubt “partly-utilitarian” systems provide much of the expected value from acausal trade. But I’m of course extremely uncertain and not sure exactly how this matters.
Also I’m currently exhausted and tend to adopt soldier mindset when exhausted so what you’re saying is probably more convincing than I’m currently appreciating...
[noticing my excessive soldier mindset at least somewhat, I added a sentence at the end of the first paragraph of my previous comment]
No worries, I was probably doing something similar.
Could you say a bit more about where you’re coming from here? (My initial intuition would be: assuming alignment ends up being based on some sort of (amplified) human feedback, doesn’t the AI developer get a lot of choice, through its control over who gives the human feedback and how feedback is aggregated (if there are multiple feedback-givers)?)
Ah sorry, to clarify, what I had in mind was mostly that (fully) non-utilitarian systems, by trading with (fully) utilitarian systems, would provide much utilitarian value. (Although on second thought, that doesn’t clearly raise the value of partly utilitarian systems more than it raises the value of fully utilitarian systems. Maybe that’s what you were suggesting?)
I should learn more, and a employees-have-power view is shared by the one person in industry I’ve spoken about this with. But I think it’s less the “team” and more either leadership or whoever deploys the system that gets to choose what values the system’s deployment promotes. I also don’t expect alignment-with-human-values to look at all like amplification-of-asking-humans-about-their-values. Maybe you’re thinking of other kinds of human feedback, but then I don’t think it’s relevant to the AI’s values.
Acausal trade: I need to think about this sometime when I can do so carefully. In particular, I think we need to be careful about ‘providing value’ relative to the baseline of an empty universe vs [a non-utilitarian AI that trades with utilitarian AIs]. (It also might be the case that less scope-sensitive systems won’t be as excited about acausal trade?) For now, I don’t have a position and I’m confused about the decision-relevant upshot.
I’d be happy to discuss this on a call sometime.
I’m thinking of ~IDA with a non-adversarial (e.g. truthful) model, but could easily be mistaken. Curious what you’re expecting?
Fair, I’m also confused.
Sure! I’ll follow up.
I agree with Zach Stein-Perlman. I did some BOTECs to justify this (see ‘evaluating outcome 3’). If a reasonable candidate for a ‘partially-utilitarian AI’ leads to an outcome where there are 10 billion happy humans on average per star, then an AI that is using every last Joule of energy to produce positive experiences would produce at least ~ 10^15 times more utility.
Why would a reasonable candidate for a ‘partially-utilitarian AI’ lead to an outcome that’s ~worthless by utilitarian lights? I disagree with that premise—that sounds like a ~non-utilitarian AI to me, not a (nontrivially) partly utilitarian AI.
(Maybe I could have put more emphasis on what kind of AI I have in mind. As my original comment mentioned, I’m talking about “a sufficiently strong version of ‘partly-utilitarian.’” So an AI that’s just slightly utilitarian wouldn’t count. More concretely, I have in mind something like: an agent that operates via a moral parliament in which utilitarianism has > 10% of representation.)
[Added] See also my reply to Zach, in which I write:
Yep, I didn’t initially understand you. That’s a great point!
This means the framework I presented in this post is wrong. I agree now with your statement:
I think the framework in this post can be modified to incorporate this and the conclusions are similar. The quantity that dominates the utility calculation is now the expected representation of utilitarianism in the AGI’s values.
The two handles become:
(1) The probability of misalignment.
(2) The expected representation of utilitarianism in the moral parliament conditional on alignment.
The conclusion of the post, then, should be something like “interventions that increase (2) might be underrated” instead of “interventions that increase the probability of fully utilitarian AGI are underrated.”
On second thought, another potential wrinkle, re:the representation of utilitarianism in the AI’s values. Here are two ways that could be defined:
In some sort of moral parliament, what % of representatives are utilitarian?
How good are outcomes relative to what would be optimal by utilitarian lights?
Arguably the latter definition is the more morally relevant one. The former is related but maybe not linearly. (E.g., if the non-utilitarians in a parliament are all scope-insensitive, maybe utilitarianism just needs 5% representation to get > 50% of what it wants. If that’s the case, then it may make sense to be risk-averse with respect to expected representation, e.g., maximize the chances that some sort of compromise happens at all.)
Thanks! From the other comment thread, now I’m less confident in the moral parliament per se being a great framework, but I’d guess something along those lines should work out.
This analysis seems to neglect all “net negative outcomes”, including scenarios in which s-risks are realized (as Mjeard noted), the badness of which can go all the way to the opposite extreme (see e.g. “Astronomical suffering from slightly misaligned artificial intelligence”).
Including that consideration may support a more general focus on ensuring a better quality of the future, which may also be supported by considerations related to grabby aliens.
Thanks for this post. I upvoted this and think the point you make is important and under-discussed.
That said, I also disagree with this post in some ways. In particular, I think the ideal version of this post would pay more attention to:
moral uncertainty*
“doom” outcomes other than extinction (especially unrecoverable dystopia)
the benefits of cooperativeness across worldviews and the harms of adversarial approaches
See e.g. https://longtermrisk.org/gains-from-trade-through-compromise/ and https://centerforreducingsuffering.org/research/why-altruists-should-be-cooperative/
I see these as like important oversights, rather than areas where the post makes explicit false claims. E.g., the post does acknowledge at the top that it assumes total utilitarianism, but it’s still the case that it seems to assume perfect confidence in total utilitarianism and seems to frame that as reasonable rather than just “for the sake of argument”, and I think that both makes the post less valuable and perhaps more misleading than it could be.
But also, tbc, it’s fair enough to write a post that moves conversations forward in some ways but isn’t perfect!
(I haven’t read the comments, so maybe much of this is covered already.)
Thanks, I have not read this in detail but I suspect something in this general direction is true (see earlier notes here).
I don’t actually think outcome 3 is achievable or particularly desirable. You’re basically asking for an AI that relentlessly cuts any non-optimal resource expenditure in favor of more and more strongly optimizing for the “good”. I think the default result of such a process is that it finds some configuration of matter which is more “happy” / “meaningful” / whatever it’s conception of “good” and sacrifices everything that’s not part of such a conception.
I also don’t think our values are shaped like that. I think a single human’s values derive from a multi agent negotiation among a continuous distribution over possible internal sub agents. This means they’re inherently dynamic, constantly changing in response to your changing cognitive environment. It also means that we essentially have a limitless variety of internal values, whose external expression is limited by our finite resources/ capabilities. Restricting the future’s values to a single, limited snapshot of that process just seems… not good.
Thanks for this, I think it is an important and under-discussed point. In their AI alignment work, EAs seem to be aiming for intent-alignment rather than social welfare production, which I think is plausibly a very large mistake, or at least one that hasn’t received very much scrutiny.
Incidentally, I also don’t know what it means to say that we have aligned AIs with ‘our values’. Since there is disagreement, ‘our’ has no referent here.
I think this needs more justification. What if p(outcome 3)<10−25?
Moreover, we need to account for tractability. If one outcome is much more tractable, then it will be much more cost-effective.
The author or readers might also find the following interesting:
Flourishing futures (a list of resources on that topic)
Holden Karnofsky’s call for people to think about “How should we value various possible long-run outcomes relative to each other?” and his notes on why and how to do so[1]
That said, fwiw, since I’m recommending Holden’s doc, I should also flag that I think the breakdown of possible outcomes that Holden sketches there isn’t a good one, because:
He defines utopia, dystopia, and “middling worlds” solely by how good they are, whereas “paperclipping” is awkwardly squeezed in with a definition based on how it comes about (namely, that it’s a world run by misaligned AI). This leads two two issues in my view:
I think the classic paperclipping scenario would itself be a “middling” world, yet Holden frames “paperclipping” as a distinct concept from “middling” worlds.
Misaligned AI actually need not lead to something approx. as good/bad as paperclipping; it could instead lead to dystopia, or could maybe lead to utopia, depending on how we define “alignment” and depending on metaethics.
There’s no explicit mention of extinction.
I think Holden is seeing “paperclipping” as synonymous with extinction?
But misaligned AI need not lead to extinction.
And extinction is “middling” relative to utopia and dystopia.
And extinction is also very different from some other “middling” worlds according to many ethical theories (though probably not total utilitarianism).
I would mostly like to protest your notion of utopia. A universe where every gram of matter is used for making brains sounds terrible. A “good” life involves interaction with other brains as well as a living environment.
Yeah, I would be in favor of interaction in simulated environments—other’s might disagree, but I don’t think this influences the general argument very much as I don’t think leaving some matter for computers will reduce the number of brains by more than an order of magnitude or so.
That’s not what I meant. What I tried to say is that the universe is full of beautiful things, like galaxies, plants, hills, dogs… More generally, complex systems with so many interesting things happening on so many scales. When I imagine a utopia, I picture a thriving human society in “harmony”, or at least at peace, with nature. Converting all of it into simulated brains sounds like a dystopian nightmare to me.
Since I first thought about my intrinsic values, I knew there’s some divergence between e.g. valuing beauty and valuing happiness singularly. But I’ve never managed to imagine a scenario where increasing one goes so much against the other, until now.
I think a large part of any hypothetical world being a utopia is that people would like to live in it. I’m not sure if you asked people about this scenario, they would find it favourable.
One key issue is that we very likely do not know enough about what utopia mean or how to achieve it. We also don’t know enough about the current expected value of the long run future (even conditional on survival). And we likely won’t make much progress on these difficult questions before AI or other X risks. Reducing P(extinction) seems to be a necessary condition to be in a position to use safe AI that makes progress on important fields that we need to figure out to understand what Utopia mean and how to increase P(Utopia) (and avoid downside risks such as S-risks in the process).
Example of fields that could be particularly important to point our safe and aligned AI towards :
-Moral philosophy (and in particular to check whether total utilitarianism is correct or if we can update to better alternatives)
-Governance mechanisms and Economics to implement our extrapolated ideal moral system in the world
It might be preferable to focus on reducing P(doom) AND reducing the risks of a premature irreversible race to the universe to give us ample time to use our safe and aligned AI to solve others important problems and make substantial progress in natural, social sciences and philosophy. (a “long reflexion” with AI that does not need to be long on astronomical scales)
The title of this post is a general claim about the long-term future, and yet nowhere in your post do you mention any x-risks other than AI. Why should we not expect other x-risks to outweigh these AGI considerations, since they may not fit into this framework of extinction, ok outcome, utopian outcome? I am not necessarily convinced that pulling the utopia handle on actions related to AGI (like the four you suggest) have a greater effect on P(utopia) than some set of non-AGI-related interventions.
Did your outcomes 2 and 3 get mixed up at some point? I feel like the evaluations don’t align with the initial descriptions of those, but maybe I’m misunderstanding.
Thanks for writing this though, this is something I’ve been thinking a little about as I try to understand longtermism better. It makes sense to be risk-averse with existential risk, but at the same I have a hard time understanding some of the more extreme takes. My wild guess would be that AI has a significantly higher chance of improving the well-being of humanity than it does causing extinction, like I said care is warranted with existential risk but at the same time slowing AI development delays your positive outcomes 2 and 3, and I haven’t seen much discussion about the downsides of delaying.
Also I’m not sure about outcome 1 having zero utility, maybe that’s standard notation but it seems unintuitive to me, like it kind of buries the downsides of extinction risk. To me it would seem more natural as a negative utility, relative to the positive utility currently existing in the world.
Yep, thanks for pointing that out! Fixed it.
I’m not sure how your first point relates to what I was saying in this post; but, I’ll take a guess. I said something about how investing in capabilities at anthropic could be good. An upside to this would be increasing the probability that EAs end up controlling the super-intelligent AGI in the future. The downside is that it could shorten timelines, but hopefully this can be mitigated by keeping all of the research under wraps (which is what they are doing). This is a controversial issue though. I haven’t thought very much about whether the upsides outweigh the downsides, but the argument in this post caused me to believe the upsides were larger than I thought before.
It doesn’t matter what outcome you assign zero value to as long as the relative values are the same since if a utility function is an affine function of another utility function then they produce equivalent decisions.
Sorry, what I said wasn’t very clear. Attempting to rephrase, I was thinking more along the lines of what the possible future for AI might look like if there were no EA interventions in the AI space. I haven’t seen much discussion of the possible downsides there (for example slowing down AI research by prioritizing alignment resulting in delays in AI advancement and delays in good things brought about by AI advancement). But this was a less-than-half-baked idea, thinking about it some more I’m having trouble thinking of scenarios where that could produce a lower expected utility.
Thanks, I follow this now and see what you mean.