I think this post misses the key considerations for perspective (1): longtermist-style scope sensitive utilitarianism. In this comment, I won’t make a positive case for the value of preventing AI takeover from a perspective like (1), but I will argue why I think the discussion in this post mostly misses the point.
(I separately think that preventing unaligned AI control of resources makes sense from perspective (1), but you shouldn’t treat this comment as my case for why this is true.)
You should treat this comment as (relatively : )) quick and somewhat messy notes rather than a clear argument. Sorry, I might respond to this post in a more clear way later. (I’ve edited this comment to add some considerations which I realized I neglected.)
I might be somewhat biased in this discussion as I work in this area and there might be some sunk costs fallacy at work.
First:
Argument two: aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives
It seems odd to me that you don’t focus almost entirely on this sort of argument when considering total utilitarian style arguments. Naively these views are fully dominated by the creation of new entities who are far more numerous and likely could be much more morally valuable than economically productive entities. So, I’ll just be talking about a perspective basically like this perspective where creating new beings with “good” lives dominates.
With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:
Over time (some subset of) humans (and AIs) will reflect on their views and perferences and will consider utilizing resources in different ways.
Over time (some subset of) humans (and AIs) will get much, much smarter or more minimally receive advice from entities which are much smarter.
It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it’s possible that this disagreement is driven by some of the other considerations I list.
Exactly what types of beings are created might be much more important than quantity.
Ultimately, I don’t care about a simplified version of total utilitarianism, I care about what preferences I would endorse on reflection. There is a moderate a priori argument for thinking that other humans which bother to reflect on their preferences might end up in a similar epistemic state. And I care less about the preferences which are relatively contingent among people who are thoughtful about reflection.
Large fractions of current wealth of the richest people are devoted toward what they claim is altruism. My guess is that this will increase over time.
Just doing a trend extrapolation on people who state an interest in reflection and scope sensitive altruism already indicates a non-trivial fraction of resources if we weight by current wealth/economic power. (I think, I’m not totally certain here.) This case is even stronger if we consider groups with substantial influence over AI.
Being able to substantially effect the preference of (at least partially unaligned) AIs that will seize power/influence still seems extremely leveraged under perspective (1) even if we accept the arguments in your post. I think this is less leveraged than retaining human control (as we could always later create AIs with the preferences we desire and I think people with a similar perspective to me will have substantial power). However, it is plausible that under your empirical views the dominant question in being able to influence the preferences of these AIs is whether you have power, not whether you have technical approaches which suffice.
I think if I had your implied empirical views about how humanity and unaligned AIs use resources I would be very excited for a proposal like “politically agitate for humanity to defer most resources to an AI successor which has moral views that people can agree are broadly reasonable and good behind the veil of ignorance”. I think your views imply that massive amounts of value are left on the table in either case such that humanity (hopefully willingly) forfeiting control to a carefully constructed successor looks amazingly.
Humans who care about using vast amounts of computation might be able to use their resources to buy this computation from people who don’t care. Suppose 10% of people (really resources weighed people) care about reflecting on their moral views and doing scope sensitive altruism of a utilitarian bent and 90% of people care about jockeying for status without reflecting on their views. It seems plausible to me that the 90% will jocky for status via things that consume relatively small amounts of computation via things like buying fancier pieces of land on earth or the coolest looking stars while the 10% of people who care about using vast amounts of computation can buy this for relatively cheap. Thus, most of the computation will go to those who care. Probably most people who don’t reflect and buy purely positional goods will care less about computation than things like random positional goods (e.g. land on earth which will be bid up to (literally) astronomical prices). I could see fashion going either way, but it seems like computation as a dominant status good seems unlikely unless people do heavy reflection. And if they heavily reflect, then I expect more altruism etc.
Your preference based arguments seem uncompelling to me because I expect that the dominant source of beings won’t be due to economic production. But I also don’t understand a version of preference utilitarianism which seems to match what you’re describing, so this seems mostly unimportant.
Given some of our main disagreements, I’m curious what you think humans and unaligned AIs will be economically consuming.
Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won’t be coming from incidental consumption.
With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:
If you could highlight only one consideration that you think I missed in my post, which one would you highlight? And (to help me understand it) can you pose the consideration in the form of an argument, in a way that directly addresses my thesis?
Hmm, this is more of a claim then a consideration but I’d highlight:
It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it’s possible that this disagreement is driven by some of the other considerations I list.
The main thing this claim disputes is:
Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations.
(and some related points).
Sorry, I don’t think this exactly addresses your comment. I’ll maybe try to do a better job in a bit. I think a bunch of the considerations I mention are relatively diffuse, but important in aggregate.
Maybe the most important single consideration is something like:
Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).
So, we should focus on the question of entities trying to create morally valuable lives (or experience or whatever relevant similar property we care about) and then answer this.
(You do seem to talk about “will AIs have more/less utilitarian impulses than humans”, but you seem to talk about this almost entirely from the perspective of growing the economy rather than question like how good the lives will be.)
Do you have an argument for why humans are more likely to try to create morally valuable lives compared to unaligned AIs?
I personally feel I addressed this particular question already in the post, although I framed it slightly differently than you have here. So I’m trying to get a better sense as to why you think my argument in the post about this is weak.
A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn’t seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn’t seem special or rare, so it’s likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they’ll likely share our high-level concepts about morality even if not our exact preferences.
Let me know what considerations you think I’m still missing here.
[ETA: note that after writing this comment, I sharpened the post slightly to make it a little more clear that this was my position in the post, although I don’t think I fundamentally added new content to the post.]
Do you have an argument for why humans are more likely to try to create morally valuable lives compared to unaligned AIs?
TBC, the main point I was trying to make was that you didn’t seem to be presenting arguments about what seems to me like the key questions. Your summary of your position in this comment seems much closer to arguments about the key questions than I interpreted your post being. I interpreted your post as claiming that most value would result from incidental economic consumption under either humans or unaligned AIs, but I think you maybe don’t stand behind this.
Separately, I think the “maybe AIs/humans will be selfish and/or not morally thoughtful” argument mostly just hits both unaligned AIs and humans equally hard such that it just gets normalized out. And then the question is more about how much you care about the altruistic and morally thoughtful subset.
(E.g., the argument you make in this comment seemed to me like about 1⁄6 of your argument in the post and it’s still only part of the way toward answering the key questions from my perspective. I think I partially misunderstood the emphasis of your argument in the post.)
I do have arguments for why I think human control is more valuable than control by AIs that seized control from humans, but I’m not going to explain them in detail in this comment. My core summary would be something like “I expect substantial convergence among morally thoughtful humans which reflect toward my utilitarian-ish views, I expect notably less convergence between me and AIs. I expect that AIs have somewhat messed up and complex and specific values in ways which might make them not care about things we care about as a results of current training processes, while I don’t have such an arguement for humans.”
As far as I what I do think the the key questions are, I think they are something like:
What do humans/AIs have for preference radically longer lives, massive self-enhancement, and potentially long periods of reflection?
How much do values/views diverge/converge between different altruistically minded humans who’ve thought about it extremely long durations?
Even if various entities are into creating “good experiences” how much do these views diverge in what is the best? My guess would be that even if two entities are maximizing good experiences from their perspective the relative goodness/compute can be much lower for the other entity, (e.g. easily 100x lower, maybe more)
How similar are my views on what is good after reflection to other humans vs AIs?
How much should we care about worlds where morally thoughtful humans reach radically diffent conclusions on reflection?
Structurally, what sorts of preferences do AI training processes impart on AIs conditionally on these AIs successfully seizing power? I also think this is likely despite humanity likely resisting to at least some extent.
It seems like your argument is something like “who knows about AI preferences, also, they’ll probably have similar concepts as we do” and “probably humanity will just have the same observed preferences as they currently do”.
But I think we can get much more specific guesses about AI preferences such that this weak indifference principle seems unimportant and I think human preferences will change radically, e.g. preferences will change far more in the next 10 million than in the last the last 2000 years.
Note that I’m not making an argument for greater value on human control in this comment, just trying to explain why I don’t think your argument is very relevant. I might try to write up something about my overall views here, but it doesn’t seem like my comparative advantage and it currently seems non-urgent from my perspective. (Though embarassing for the field as a whole.)
I interpreted your post as claiming that most value would result from incidental economic consumption under either humans or unaligned AIs, but I think you maybe don’t stand behind this.
It’s possible we’re using these words differently, but I guess I’m not sure why you’re downplaying the value of economic consumption here. I focused on economic consumption for a simple reason: economic consumption is intrinsically about satisfying the preferences of agents, including the type of preferences you seem to think matter. For example, I’d classify most human preferences as consumption, including their preference to be happy, which they try to satisfy via various means.
If either a human or an AI optimizes for their own well-being by giving themselves an extremely high intensity positive experience in the future, I don’t think that would be vastly morally outweighed by someone doing something similar but for altruistic reasons. Just because the happiness arises from a selfish motive seems like no reason, by itself, to disvalue it from a utilitarian perspective.
As a consequence, I simply do not agree with the intuition that economic consumption is a rounding error compared to the much smaller fraction of resources spent on altruistic purposes.
I think the “maybe AIs/humans will be selfish and/or not morally thoughtful” argument mostly just hits both unaligned AIs and humans equally hard such that it just gets normalized out. And then the question is more about how much you care about the altruistic and morally thoughtful subset.
I disagree because I don’t see why altruism will be more intense than selfishness from a total utilitarian perspective, in the sense you are describing. If an AI makes themselves happy for selfish reasons, that should matter just as much as an AI creating another AI to make them happy.
Now again, you could just think that AIs aren’t likely to be conscious, or aren’t likely to be motivated to make themselves happy in any sort of selfish sense. And so an unaligned world could be devoid of extremely optimized utilitarian value. But this argument was also addressed at length in my post, and I don’t know what your counterargument is to it.
It’s possible we’re using these words differently, but I guess I’m not sure why you’re downplaying the value of economic consumption here
Ah, sorry, I was referring to the process of the AI labor being used to accomplish the economic output not having much total moral value. I thought you were arguing that aligned AIs being used to produce goods would have be where most value is coming from because of the vast numbers of such AIs laboring relative to other enitites. Sorry by “from incidental economic consumption” I actually meant “incidentally (as a side effect from) economic consumption”. This is in response to things like:
Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations. To put it another way, the key factor influencing whether AIs are conscious in this scenario will be the relative efficiency of creating conscious AIs compared to unconscious ones for producing the goods and services demanded by future people. As these efficiency factors are likely to be similar in both aligned and unaligned scenarios, we are led to the conclusion that, from a total utilitarian standpoint, there is little moral difference between these two outcomes.
As far as the other thing you say, I still disagree, though for different (related) reasons:
As a consequence, I simply do not agree with the intuition that economic consumption is a rounding error compared to the much smaller fraction of resources spent on altruistic purposes.
I don’t agree with either “much smaller” and I think rounding error is reasonably likely as far as the selfish preferences of current existing humans or the AIs that seize control go. (These entities might (presumably altruistically) create entities which then selfishly satisfy their preferences, but that seems pretty different.)
My main counterargument is that selfish preference will result in wildly fewer entities if such entities aren’t into (presumably altruistically) making more entities and thus will be extremely inefficient. Of course it’s possible that you have AIs with non-indexical preferences but which are de facto selfish in other ways.
E.g., for humans you have 10^10 beings which are probably radically inefficient at producing moral value. For AIs it’s less clear and depends heavily on how you operationalize selfishness.
I have a general view like “in the future, the main way you’ll get specific things that you might care about is via people trying specifically to make those things because optimization is extremely powerful”.
I’m probably not going to keep responding as I don’t think I’m comparatively advantaged in fleshing this out. And doing this in a comment section seems suboptimal. If this is anyone’s crux for working on AI safety though, consider contacting me and I’ll consider setting you up with someone who I think understands my views and would to go through relevant arguments with you. Same offer applies to you Matthew particularly if this is a crux, but I think we should use a medium other than EA forum comments.
I thought you were arguing that aligned AIs being used to produce goods would have be where most value is coming from because of the vast numbers of such AIs laboring relative to other enitites.
Admittedly I worded things poorly in that part, but the paragraph you quoted was intended to convey how consciousness is most likely to come about in AIs, rather than to say that the primary source of value in the world will come from AIs laboring for human consumption.
These are very subtly different points, and I’ll have to work on making my exposition here more clear in the future (including potentially re-writing that part of the essay).
E.g., for humans you have 10^10 beings which are probably radically inefficient at producing moral value. For AIs it’s less clear and depends heavily on how you operationalize selfishness.
Note that a small human population size is an independent argument here for thinking that AI alignment might not be optimal from a utilitarian perspective. I didn’t touch on this point in this essay because I thought it was already getting too complex and unwieldy as it was, but the idea here is pretty simple, and it seems you’ve already partly spelled out the argument. If AI alignment causes high per capita incomes (because it enriches humans with a small population size), then plausibly this is worse than having a far larger population of unaligned AIs who have lower per capita consumption, from a utilitarian point of view.
If AI alignment causes high per capita incomes (because it enriches humans with a small population size), then plausibly this is worse than having a far larger population of unaligned AIs who have lower per capita consumption, from a utilitarian point of view.
Both seems negligible relative to the expected amount of compute spent on optimized goodness in my view.
Also, I’m not sold that there will be more AIs, it depends on pretty complex details about AI preferences. I think it’s likely AIs won’t have preferences for their own experiences given current training methods and will instead have preferences for causing certain outcomes.
Both seems negligible relative to the expected amount of compute spent on optimized goodness in my view.
Both will presumably be forms of consumption, which could be in the form of compute spent on optimized goodness. You seem to think compute will only be used for optimized goodness for non-consumption purposes (which is why you care about the small fraction of resources spent on altruism) and I’m saying I don’t see a strong case for that.
Regardless, doesn’t seem like we’re making progresss here.
You have no obligation to reply, of course, but I think we’d achieve more progress if you clarified your argument in a concise format that explicitly outlines the assumptions and conclusion.
As far as I can gather, your argument seems to be a mix of assumptions about humans being more likely to optimize for goodness (why?), partly because they’re more inclined to reflect (why?), which will lead them to allocate more resources towards altruism rather than selfish consumption (why is that significant?). Without understanding how your argument connects to mine, it’s challenging to move forward on resolving our mutual disagreement.
My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don’t know) rather than humans (where we know some are total utilitarians). In your taxonomy this would be “humans are more likely to optimize for goodness”.
Let’s make a toy model compatible with your position:
A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn’t seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn’t seem special or rare, so it’s likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they’ll likely share our high-level concepts about morality even if not our exact preferences.
Let’s say that there are a million values that one could have with “humanity’s high-level concepts about morality”, one of which is “Rohin’s values”.
For (3), we’ll say that both unaligned AI values and human values are a subset sampled uniformly at random from these million values (all values in the subset weighted equally, for simplicity).
For (1), we’ll say that the sampled human values include “Rohin’s values”, but only as one element in the set of sampled human values.
I won’t make any special distinction about consciousness so (2) won’t matter.
In this toy model you’d expect aligned AI to put 1⁄1,000 weight on “Rohin’s values”, whereas unaligned AI puts 1⁄1,000,000 weight in expectation on “Rohin’s values” (if the unaligned AI has S values, then there’s an S/1,000,000 probability of it containing “Rohin’s values”, and it is weighted 1/S if present). So aligned AI looks a lot better.
More generally, ceteris paribus, keeping values intact prevents drift and so looks strongly positive from the point of view of the original values, relative to resampling values “from scratch”.
(Feel free to replace “Rohin’s values” with “utilitarianism” if you want to make the utilitarianism version of this argument.)
Imo basically everything that Ryan says in this comment thread is a countercounterargument to a counterargument to this basic argument. E.g. someone might say “oh it doesn’t matter which values you’re optimizing for, all of the value is in the subjective experience of the AIs that are laboring to build new chips, not in the consumption of the new chips” and the rebuttal to that is “Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).”
My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don’t know) rather than humans (where we know some are total utilitarians).
I’m curious: Does your reaction here similarly apply to ordinary generational replacement as well?
Let me try to explain what I’m asking.
We have a set of humans who exist right now. We know that some of them are utilitarians. At least one of them shares “Rohin’s values”. Similar to unaligned AIs, we don’t know the values of the next generation of humans, although presumably they will continue to share our high-level moral concepts since they are human and will be raised in our culture. After the current generation of humans die, the next generation could have different moral values.
As far as I can tell, the situation with regards to the next generation of humans is analogous to unaligned AI in the basic sense I’ve just laid out (mirroring the part of your comment I quoted). So, in light of that, would you similarly say that it’s “pretty wild to think that total utilitarian values would be better served by a future generation of humans”?
One possible answer here: “I’m not very worried about generational replacement causing moral values to get worse since the next generation will still be human.” But if this is your answer, then you seem to be positing that our moral values are genetic and innate, rather than cultural, which is pretty bold, and presumably merits a defense. This position is IMO largely empirically ungrounded, although it depends on what you mean by “moral values”.
Another possible answer is: “No, I’m not worried about generational replacement because we’ve seen a lot of human generations already and we have lots of empirical data on how values change over time with humans. AI could be completely different.” This would be a reasonable response, but as a matter of empirical fact, utilitarianism did not really culturally exist 500 or 1000 years ago. This indicates that it’s plausibly quite fragile, in a similar way it might also be with AI. Of course, values drift more slowly with ordinary generational replacement compared to AI, but the phenomenon still seems roughly pretty similar. So perhaps you should care about ordinary value drift almost as much as you’d care about unaligned AIs.
If you do worry about generational value drift in the strong sense I’ve just described, I’d argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).
To the extent that future generations would have pretty different values than me, like “the only glory is in war and it is your duty to enslave your foes”, along with the ability to enact their values on the reachable universe, in fact that would seem pretty bad to me.
However, I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI. (This doesn’t require values to be innate, values can be adaptive for many human cultures but not for AI cultures.) So I would be less worried about generational value drift (but not completely unworried).
In addition, this worry is tempered even more by the possibility that values / culture will be set much more deliberately in the nearish future, rather than via culture, simply because with an intelligence explosion that becomes more possible to do than it is today.
If you do worry about generational value drift in the strong sense I’ve just described, I’d argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).
Huh? I feel very confused about this, even if we grant the premise. Isn’t the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren’t going to enact large-scale change? Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?
I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI.
To clarify, I think it’s a reasonable heuristic that, if you want to preserve the values of the present generation, you should try to minimize changes to the world and enforce some sort of stasis. This could include not building AI. However, I believe you may be glossing over the distinction between: (1) the values currently held by existing humans, and (2) a more cosmopolitan, utilitarian ethical value system.
We can imagine a wide variety of changes to the world that would result in a vast changes to (1) without necessarily being bad according to (2). For example:
We could start doing genetic engineering of humans.
We could upload humans onto computers.
A human-level, but conscious, alien species could immigrate to Earth via a portal.
In each scenario, I agree with your intuition that “the correlation between my values and future humans is higher than the correlation between my values and X-values, because I share much more background with future humans than with X”, where X represents the forces at play in each scenario. However, I don’t think it’s clear that the resulting change to the world would be net negative from the perspective of an impartial, non-speciesist utilitarian framework.
In other words, while you’re introducing something less similar to us than future human generations in each scenario, it’s far from obvious whether the outcome will be relatively worse according to utilitarianism.
Based on your toy model, my guess is that your underlying intuition is something like, “The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it’s very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%.”
If this captures your view, my main response is that it seems to assume a much narrower and more fragile conception of “cosmopolitan utilitarian values” than the version I envision, and it’s not a moral perspective I currently find compelling.
Conversely, if you’re imagining a highly contingent, fragile form of utilitarianism that regards the world as far worse under a wide range of changes, then I’d argue we also shouldn’t expect future humans to robustly hold such values. This makes it harder to claim the problem of value drift is much worse for AI compared to other forms of drift, since both are simply ways the state of the world could change, which was the point of my previous comment.
I feel very confused about this, even if we grant the premise. Isn’t the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren’t going to enact large-scale change?
I’m not sure I understand which part of the idea you’re confused about. The idea was simply:
Let’s say that your view is that generational value drift is very risky, because future generations could have much worse values from the ones you care about (relative to the current generation)
In that case, you should try to do what you can to stop generational value drift
One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
This would look quite similar to the moral view in which you’re trying to protect the current generation of humans, which was the third moral view I discussed in the post.
Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?
The reason the priority would be on current people rather than those with similar values is that, by assumption, future generations will have different values due to value drift. Therefore, the ~best strategy to preserve current values would be to preserve existing people. This seems relatively straightforward to me, although one could certainly question the premise of the argument itself.
Let me know if any part of the simplified argument I’ve given remains unclear or confusing.
Based on your toy model, my guess is that your underlying intuition is something like, “The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it’s very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%.”
No, this was purely to show why, from the perspective of someone with values, re-rolling those values would seem bad, as opposed to keeping the values the same, all else equal. In any specific scenario, (a) all else won’t be equal, and (b) the actual amount of worry depends on the correlation between current values and re-rolled values.
The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that’s what you were arguing (e.g. when you say things like “humans are largely not utilitarians themselves”). I don’t have a strong view on this and I don’t think it really matters for the positions I take.
For example:
We could start doing genetic engineering of humans.
We could upload humans onto computers.
A human-level, but conscious, alien species could immigrate to Earth via a portal.
The first two seem broadly fine, because I still expect high correlation between values. (Partly because I think that cosmopolitan utilitarian-ish values aren’t fragile.)
The last one seems more worrying than human-level unaligned AI (more because we have less control over them) but less worrying than unaligned AI in general (since the aliens aren’t superintelligent).
Note I’ve barely thought about these scenarios, so I could easily imagine changing my mind significantly on these takes. (Though I’d be surprised if it got to the point where I thought it was comparable to unaligned AI, in how much the values could stop correlating with mine.)
One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
It seems way better to simply try to spread your values? It’d be pretty wild if the EA field-builders said “the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours”.
The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that’s what you were arguing (e.g. when you say things like “humans are largely not utilitarians themselves”).
I think there may have been a misunderstanding regarding the main point I was trying to convey. In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.
When you quoted me saying “humans are largely not utilitarians themselves,” I intended this point to support the idea that our current rough level of utilitarianism is not contingent, rather than the opposite claim. In other words, I meant that the fact that humans are not highly utilitarian suggests that this level of utilitarianism is not unusual or contingent upon specific circumstances, and we might expect other intelligent beings, such as aliens or AIs, to exhibit similar, or even greater, levels of utilitarianism.
Compare to the hypothetical argument: humans aren’t very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.
(This argument is analogous because pyramids are simple structures that lots of different civilizations would likely stumble upon. Similarly, I think “try to create lots of good conscious experiences” is also a fairly simple directive, if indeed aliens/AIs/whatever are actually conscious themselves.)
I don’t have a strong view on this and I don’t think it really matters for the positions I take.
I think the question of whether utilitarianism is contingent or not matters significantly for our disagreement, particularly if you are challenging my post or the thesis I presented in the first section. If you are very uncertain about whether utilitarianism is contingent in the sense that is relevant to this discussion, then I believe that aligns with one of the main points I made in that section of my post.
Specifically, I argued that the degree to which utilitarianism is contingent vs. common among a wide range of intelligent beings is highly uncertain and unclear, and this uncertainty is an important consideration when thinking about the values and behaviors of advanced AI systems from a utilitarian perspective. So, if you are expressing strong uncertainty on this matter, that seems to support one of my central claims in that part of the post.
(My view, as expressed in the post, is that unaligned AIs have highly unclear utilitarian value but there’s a plausible scenario where they are roughly net-neutral, and indeed I think there’s a plausible scenario where they are even more valuable than humans, from a utilitarian point of view.)
It seems way better to simply try to spread your values? It’d be pretty wild if the EA field-builders said “the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours”.
I think this part of your comment plausibly confuses two separate points:
How to best further your own values
How to best further the values of the current generation.
I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1). That said, to the extent that your values simply mirror the values of your generation, I don’t understand your argument for why trying to spread your values would be “way better” than trying to preserve the current generation. Perhaps you can elaborate?
Given my new understanding of the meaning of “contingent” here, I’d say my claims are:
I’m unsure about how contingent the development of utilitarianism in humans was. It seems quite plausible that it was not very historically contingent. I agree my toy model does not accurately capture my views on the contingency of total utilitarianism.
I’m also unsure how contingent it is for unaligned AI, but aggregating over my uncertainty suggests more contingent.
One way to think about this is to ask: why are any humans utilitarians? To the extent it’s for reasons that don’t apply to unaligned AI systems, I think you should feel like it is less likely for unaligned AI systems to be utilitarians. So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the “true explanation” will have to appeal to more than simplicity, and the additional features this “true explanation” will appeal to are very likely to differ between humans and AIs.)
Compare to the hypothetical argument: humans aren’t very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.
Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was “maybe pyramids were especially easy to build historically”.)
General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect “my values” (which I’m uncertain about and may not reflect utilitarianism) than a world with unaligned AI. But I’m happy to continue talking about the implications for utilitarians.
So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the “true explanation” will have to appeal to more than simplicity, and the additional features this “true explanation” will appeal to are very likely to differ between humans and AIs.)
Thanks for trying to better understand my views. I appreciate you clearly stating your reasoning in this comment, as it makes it easier for me to directly address your points and explain where I disagree.
You argued that feeling pleasure and pain, as well as having empathy, are important factors in explaining why some humans are utilitarians. You suggest that to the extent these reasons for being utilitarian don’t apply to unaligned AIs, we should expect it to be less likely for them to be utilitarians compared to humans.
However, a key part of the first section of my original post was about whether unaligned AIs are likely to be conscious—which for the purpose of this discussion, seems roughly equivalent to whether they will feel pleasure and pain. I concluded that unaligned AIs are likely to be conscious for several reasons:
Consciousness seems to be a fairly convergent function of intelligence, as evidenced by the fact that octopuses are widely accepted to be conscious despite sharing almost no homologous neural structures with humans. This suggests consciousness arises somewhat robustly in sufficiently sophisticated cognitive systems.
Leading theories of consciousness from philosophy and cognitive science don’t appear to predict that consciousness will be rare or unique to biological organisms. Instead, they tend to define consciousness in terms of information processing properties that AIs could plausibly share.
Unaligned AIs will likely be trained in environments quite similar to those that gave rise to human and animal consciousness—for instance, they will be trained on human cultural data and, in the case of robots, will interact with physical environments. The evolutionary and developmental pressures that gave rise to consciousness in biological organisms would thus plausibly apply to AIs as well.
So in short, I believe unaligned AIs are likely to feel pleasure and pain, for roughly the reasons I think humans and animals do. Their consciousness would not be an improbable or fragile outcome, but more likely a robust product of being a highly sophisticated intelligent agent trained in environments similar to our own.
I did not directly address whether unaligned AIs would have empathy, though I find this fairly likely as well. At the very least, I expect they would have cognitive empathy—the ability to model and predict the experiences of others—as this is clearly instrumentally useful. They may lack affective empathy, i.e. the ability to share the emotions of others, which I agree could be important here. But it’s notable that explicit utilitarianism seems, anecdotally, to be more common among people on the autism spectrum, who are characterized as having reduced affective empathy. This suggests affective empathy may not be strongly predictive of utilitarian motivations.
Let’s say you concede the above points and say: “OK I concede that unaligned AIs might be conscious. But that’s not at all assured. Unaligned AIs might only be 70% likely to be conscious, whereas I’m 100% certain that humans are conscious. So there’s still a huge gap between the expected value of unaligned AIs vs. humans under total utilitarianism, in a way that overwhelmingly favors humans.”
However, this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans, or have an even stronger tendency towards utilitarian motivations. This could be the case if, for instance, AIs are more cognitively sophisticated than humans or are more efficiently designed in a morally relevant sense. Given that the vast majority of humans do not seem to be highly motivated by utilitarian considerations, it doesn’t seem like an unlikely possibility that AIs could exceed our utilitarian inclinations. Nor does it seem particularly unlikely that their minds could have a higher density of moral value per unit of energy, or matter.
We could similarly examine this argument in the context of considering other potential large changes to the world, such as creating human emulations, genetically engineered humans, or bringing back Neanderthals from extinction. In each case, I do not think the (presumably small) probability that the entities we are adding to the world are not conscious constitutes a knockdown argument against the idea that they would add comparable utilitarian value to the world compared to humans. The main reason is because these entities could be even better by utilitarian lights than humans are.
Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was “maybe pyramids were especially easy to build historically”.)
This seems minor, but I think the relevant claim is whether AIs would build more pyramids going forward, compared to humans, rather than comparing to historical levels of pyramid construction among humans. If pyramids were easy to build historically, but this fact is no longer relevant, then that seems true now for both humans and AIs, into the foreseeable future. As a consequence it’s hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization. By essentially the same arguments I gave above about utilitarianism, I don’t think there’s a strong argument for thinking that aligning AIs is good from the perspective of pyramid maximization.
General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect “my values”
This makes sense to me, but it’s hard to say much about what’s good from the perspective of your values if I don’t know what those values are. I focused on total utilitarianism in the post because it’s probably the most influential moral theory in EA, and it’s the explicit theory used in Nick Bostrom’s influential article Astronomical Waste, and this post was partly intended as a reply to that article (see the last few paragraphs of the post).
This suggests affective empathy may not be strongly predictive of utilitarian motivations.
I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I’d feel pretty surprised if this were true in whatever distribution over unaligned AIs we’re imagining. In particular, I think if there’s no particular reason to expect affective empathy in unaligned AIs, then your prior on it being present should be near-zero (simply because there are lots of specific claims about unaligned AIs about that complicated most of which will be false). And I’d be surprised if “zero vs non-zero affective empathy” was not predictive of utilitarian motivations.
I definitely agree that AIs might feel pleasure and pain, though I’m less confident in it than you seem to be. It just seems like AI cognition could be very different from human cognition. For example, I would guess that pain/pleasure are important for learning in humans, but it seems like this is probably not true for AI systems in the current paradigm. (For gradient descent, the learning and the cognition happen separately—the AI cognition doesn’t even get the loss/reward equivalent as an input so cannot “experience” it. For in-context learning, it seems very unclear what the pain/pleasure equivalent would be.)
this line of argument would overlook the real possibility that unaligned AIs could [...] have an even stronger tendency towards utilitarian motivations.
I agree this is possible. But ultimately I’m not seeing any particularly strong reasons to expect this (and I feel like your arguments are mostly saying “nothing rules it out”). Whereas I do think there’s a strong reason to expect weaker tendencies: AIs will be different, and on average different implies fewer properties that humans have. So aggregating these I end up concluding that unaligned AIs will be less utilitarian in expectation.
(You make a bunch of arguments for why AIs might not be as different as we expect. I agree that if you haven’t thought about those arguments before you should probably reduce your expectation of how different AIs will be. But I still think they will be quite different.)
this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans,
I don’t see why it matters if AIs are more conscious than humans? I thought the relevant question we’re debating is whether they are more likely to be utilitarians. Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it’s a weak effect.
As a consequence it’s hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization.
Sure, but a big difference is that no human cares about pyramid-maximization, whereas some humans are utilitarians?
(Maybe some humans do care about pyramid-maximization? I’d need to learn more about those humans before I could have any guess about whether to prefer humans over AIs.)
Consciousness seems to be a fairly convergent function of intelligence
I would say “fairly convergent function of biologically evolved intelligence”. Evolution faced lots of constraints we don’t have in AI design. For example, cognition and learning had to be colocated in space and time (i.e. done in a single brain), whereas for AIs these can be (and are) separated. Seems very plausible that consciousness-in-the-sense-of-feeling-pleasure-and-pain is a solution needed under the former constraint but not the latter. (Maybe I’m at 20% chance that something in this vicinity is right, though that is a very made-up number.)
Here are a few (long, but high-level) comments I have before responding to a few specific points that I still disagree with:
I agree there are some weak reasons to think that humans are likely to be more utilitarian on average than unaligned AIs, for basically the reasons you talk about in your comment (I won’t express individual agreement with all the points you gave that I agree with, but you should know that I agree with many of them).
However, I do not yet see any strong reasons supporting your view. (The main argument seems to be: AIs will be different than us. You label this argument as strong but I think it is weak.) More generally, I think that if you’re making hugely consequential decisions on the basis of relatively weak intuitions (which is what I believe many effective altruists do in this context), you should be very cautious. The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold. (I think I was pretty careful in my language not to overstate the main claims.)
I suspect you may have an intuition that unaligned AIs will be very alien-like in certain crucial respects, but I predict this intuition will ultimately prove to be mistaken. In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight. These factors make it quite likely, in my view, that the resulting AI systems will exhibit utilitarian tendencies to a significant degree, even if they do not share the preferences of either their users or their creators (for instance, I would guess that GPT-4 is already more utilitarian than the average human, in a meaningful sense).
There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions. I am not persuaded by the idea that it’s probable for AIs to be entirely human-compatible on the surface while being completely alien underneath, even if we assume they do not share human preferences (e.g., the “shoggoth” meme).
I disagree with the characterization that my argument relies primarily on the notion that “you can’t rule out” the possibility of AIs being even more utilitarian than humans. In my previous comment, I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a “you can’t rule it out” type of argument, in my view.
Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions. To give another example (although it was not prominent in the post), in a footnote I alluded to the idea that AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend). This too does not seem like a mere “you cannot rule it out” argument to me, although I agree it is not the type of knockdown argument you’d expect if my thesis were stated way stronger than it actually was.
I think you may be giving humans too much credit for being slightly utilitarian. To the extent that there are indeed many humans who are genuinely obsessed with actively furthering utilitarian objectives, I agree that your argument would have more force. However, I think that this is not really what we actually observe in the real world to a large degree. I think it’s exaggerated at least; even within EA I think that’s somewhat rare.
I suspect there is a broader phenomenon at play here, whereby people (often those in the EA community) attribute a wide range of positive qualities to humans (such as the idea that our values converge upon reflection, or the idea that humans will get inherently kinder as they get wealthier) which, in my opinion, do not actually reflect the realities of the world we live in. These ideas seem (to me) to be routinely almost entirely disconnected from any empirical analysis of actual human behavior, and they sometimes appear to be more closely related to what the person making the claim wishes to be true in some kind of idealized, abstract sense (though I admit this sounds highly uncharitable).
My hypothesis is that this tendency can maybe perhaps be explained by a deeply ingrained intuition that identifies the species boundary of “humans” as being very special, in the sense that virtually all moral value is seen as originating from within this boundary, sharply distinguishing it from anything outside this boundary, and leading to an inherent suspicion of non-human entities. This would explain, for example, why there is so much focus on “human values” (and comparatively little on drawing the relevant “X values” boundary along different lines), and why many people seem to believe that human emulations would be clearly preferable to de novo AI. I do not really share this intuition myself.
I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I’d feel pretty surprised if this were true in whatever distribution over unaligned AIs we’re imagining.
My basic thoughts here are: on the one hand we have real world data points which can perhaps relevantly inform the degree to which affective empathy actually predicts utilitarianism, and on the other hand we have an intuition that it should be predictive across beings of very different types. I think the real world data points should epistemically count for more than the intuitions? More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.
Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it’s a weak effect.
Isn’t this the effect you alluded to, when you named reasons why some humans are utilitarians?
In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight.
… This seems to be saying that because we are aligning AI, they will be more utilitarian. But I thought we were discussing unaligned AI?
I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold.
I agree with theses like “it tentatively appears that the normative value of alignment work is very uncertain, and plausibly approximately neutral, from a total utilitarian perspective”, and would go further and say that alignment work is plausibly negative from a total utilitarian perspective.
I disagree with the implied theses in statements like “I’m not very sympathetic to pausing or slowing down AI as a policy proposal.”
If you wrote a post that just said “look, we’re super uncertain about things, here’s your reminder that there are worlds in which alignment work is negative”, I’d be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to “well my main thesis was very weak and unobjectionable”.
Some more minor comments:
You label this argument as strong but I think it is weak
Well, I can believe it’s weak in some absolute sense. My claim is that it’s much stronger than all of the arguments you make put together.
There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions.
This is a pretty good example of something I’d call different! You even use the adjective “inhumanly”!
To the extent your argument is that this is strong evidence that the AIs will continue to be altruistic and kind, I think I disagree, though I’ve now learned that you are imagining lots of alignment work happening when making the unaligned AIs, so maybe I’d agree depending on the specific scenario you’re imagining.
I disagree with the characterization that my argument relies primarily on the notion that “you can’t rule out” the possibility of AIs being even more utilitarian than humans.
Sorry, I was being sloppy there. My actual claim is that your arguments either:
Don’t seem to bear on the question of whether AIs are more utilitarian than humans, OR
Don’t seem more compelling than the reversed versions of those arguments.
I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a “you can’t rule it out” type of argument, in my view.
I agree that there’s a positive reason to expect AIs to have a higher density of moral value per unit of matter. I don’t see how this has any (predictable) bearing on whether AIs will be more utilitarian than humans.
Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
Applying the reversal test:
Humans have utilitarian intuitions too, and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
I don’t especially see why one of these is stronger than the other.
(And if the AI doesn’t share any of the utilitarian intuitions, it doesn’t matter at all if it also doesn’t share the anti-utilitarian intuitions; either way it still won’t be a utilitarian.)
To give another example [...] AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend)
Applying the reversal test:
AIs might care less about reproduction than humans (a large majority of whom will reproduce at least once in their life).
Personally I find the reversed version more compelling.
I think you may be giving humans too much credit for being slightly utilitarian. [...] people (often those in the EA community) attribute a wide range of positive qualities to humans [...]
Fwiw my reasoning here mostly doesn’t depend on facts about humans other than binary questions like “do humans ever display property X”, since by and large my argument is “there is quite a strong chance that unaligned AIs do not have property X at all”.
Though again this might change depending on what exactly you mean by “unaligned AI”.
(I don’t necessarily disagree with your hypotheses as applied to the broader world—they sound plausible, though it feels somewhat in conflict with the fact that EAs care about AI consciousness a decent bit—I just disagree with them as applied to me in this particular comment thread.)
I think the real world data points should epistemically count for more than the intuitions?
I don’t buy it. The “real world data points” procedure here seems to be: take two high-level concepts (e.g. affective empathy, proclivity towards utilitarianism), draw a line between them, extrapolate way way out of distribution. I think this procedure would have a terrible track record when applied without the benefit of hindsight.
I expect my arguments based on intuitions would also have a pretty bad track record, but I do think they’d outperform the procedure above.
More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.
Yup, this is an unfortunate fact about domains where you don’t get useful real world data. That doesn’t mean you should start using useless real world data.
Isn’t this the effect you alluded to, when you named reasons why some humans are utilitarians?
Yes, but I think the relevance is mostly whether or not the being feels pleasure or pain at all, rather than the magnitude with which it feels it. (Probably the magnitude matters somewhat, but not very much.)
Among humans I would weakly predict the opposite effect, that people with less pleasure-pain salience are more likely to be utilitarian (mostly due to a predicted anticorrelation with logical thinking / decoupling / systemizing nature).
Just a quick reply (I might reply more in-depth later but this is possibly the most important point):
I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
In my post I talked about the “default” alternative to doing lots of alignment research. Do you think that if AI alignment researchers quit tomorrow, engineers would stop doing RLHF etc. to their models? That they wouldn’t train their AIs to exhibit human-like behaviors, or to be human-compatible?
It’s possible my language was misleading by giving an image of what unaligned AI looks like that isn’t actually a realistic “default” in any scenario. But when I talk about unaligned AI, I’m simply talking about AI that doesn’t share the preferences of humans (either its creator or the user). Crucially, humans are routinely misaligned in this sense. For example, employees don’t share the exact preferences of their employer (otherwise they’d have no need for a significant wage). Yet employees are still typically docile, human-compatible, and assimilated to the overall culture.
This is largely the picture I think we should imagine when we think about the “default” unaligned alternative, rather than imaging that humans will create something far more alien, far less docile, and therefore something with far less economic value.
(As an aside, I thought this distinction wasn’t worth making because I thought most readers would have already strongly internalized the idea that RLHF isn’t “real alignment work”. I suspect I was mistaken, and probably confused a ton of people.)
I disagree with the implied theses in statements like “I’m not very sympathetic to pausing or slowing down AI as a policy proposal.”
This overlooks my arguments in section 3, which were absolutely critical to forming my opinion here. My argument here can be summarized as follows:
The utilitarian arguments for technical alignment research seem weak, because AIs are likely to be conscious like us, and also share human moral concepts.
By contrast, technical alignment research seems clearly valuable if you care about humans who currently exist, since AIs will presumably be directly aligned to them.
However, pausing AI for alignment reasons seems pretty bad for humans who currently exist (under plausible models of the tradeoff).
I have sympathies to both utilitarianism and the view that current humans matter. The weak considerations favoring pausing AI on the utilitarian side don’t outweigh the relatively much stronger and clearer arguments against pausing for currently existing humans.
The last bullet point is a statement about my values. It is not a thesis independently of my values. I feel this was pretty explicit in the post.
If you wrote a post that just said “look, we’re super uncertain about things, here’s your reminder that there are worlds in which alignment work is negative”, I’d be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to “well my main thesis was very weak and unobjectionable”.
I’m not just saying “there are worlds in which alignment work is negative”. I’m saying that it’s fairly plausible. I’d say greater than 30% probability. Maybe higher than 40%. This seems perfectly sufficient to establish the position, which I argued explicitly, that the alternative position is “fairly weak”.
It would be different if I was saying “look out, there’s a 10% chance you could be wrong”. I’d agree that claim would be way less interesting.
I don’t think what I said resembles a motte-and-bailey, and I suspect you just misunderstood me.
[ETA:
Well, I can believe it’s weak in some absolute sense. My claim is that it’s much stronger than all of the arguments you make put together.
Part of me feels like this statement is an acknowledgement that you fundamentally agree with me. You think the argument in favor of unaligned AIs being less utilitarian than humans is weak? Wasn’t that my thesis? If you started at a prior of 50%, and then moved to 65% because of a weak argument, and then moved back to 60% because of my argument, then isn’t that completely consistent with essentially every single thing I said? OK, you felt I was saying the probability is like 50%. But 60% really isn’t far off, and it’s consistent with what I wrote (I mentioned “weak reasons” in the post). Perhaps like 80% of the reason why you disagree here is because you think my thesis was something else.
More generally I get the sense that you keep misinterpreting me as saying things that are different or stronger than what I intended. That’s reasonable given that this is a complicated and extremely nuanced topic. I’ve tried to express areas of agreement when possible, both in the post and in reply to you. But maybe you have background reasons to expect me to argue a very strong thesis about utilitarianism. As a personal statement, I’d encourage you to try to read me as saying something closer to the literal meaning of what I’m saying, rather than trying to infer what I actually believe underneath the surface.]
I have lots of other disagreements with the rest of what you wrote, although I probably won’t get around to addressing them. I mostly think we just disagree on some basic intuitions about how alien-like default unaligned AIs will actually be in the relevant senses. I also disagree with your reversal tests, because I think they’re not actually symmetric, and I think you’re omitting the best arguments for thinking that they’re asymmetric.
I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1).
I was always thinking about (1), since that seems like the relevant thing. When I agreed with you that generational value drift seems worrying, that’s because it seems bad by (1). I did not mean to imply that I should act to maximize (2). I agree that if you want to act to maximize (2) then you should probably focus on preserving the current generation.
In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.
Fwiw, I reread the post again and still failed to find this idea in it, and am still pretty confused at what argument you are trying to make.
At this point I think we’re clearly failing to communicate with each other, so I’m probably going to bow out, sorry.
Fwiw, I reread the post again and still failed to find this idea in it
I’m baffled by your statement here. What did you think I was arguing when discussed whether “aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives”? The conclusion of that section was that aligned AIs are plausibly not more likely to have such a preference, and therefore, human utilitarian preferences here are not “unusually high compared to other possibilities” (the relevant alternative possibility here being unaligned AI).
This was a central part of my post that I discussed at length. The idea that unaligned AIs might be similarly utilitarian or even more so, compared to humans, was a crucial part of my argument. If indeed unaligned AIs are very likely to be less utilitarian than humans, then much of my argument in the first section collapses, which I explicitly acknowledged.
I consider your statement here to be a valuable data point about how clear my writing was and how likely I am to get my ideas across to others who read the post. That said, I believe I discussed this point more-or-less thoroughly.
ETA: Claude 3′s summary of this argument in my post:
The post argued that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs. This argument was made in the context of discussing whether aligned AIs are more likely to have a preference for creating new conscious entities, thereby furthering utilitarian objectives.
The author presented several points to support this argument:
Only a small fraction of humans are total utilitarians, and most humans do not regularly express strong preferences for adding new conscious entities to the universe.
Some human moral intuitions directly conflict with utilitarian recommendations, such as the preference for habitat preservation over intervention to improve wild animal welfare.
Unaligned AI preferences are unlikely to be completely alien or random compared to human preferences if the AIs are trained on human data. By sharing moral concepts with humans, unaligned AIs could potentially be more utilitarian than humans, given that human moral preferences are a mix of utilitarian and anti-utilitarian intuitions.
Even in an aligned AI scenario, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations.
The author concluded that these points undermine the idea that unaligned AI moral preferences will be clearly less utilitarian than the moral preferences of most humans, which are already not very utilitarian. This suggests that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs.
I agree it’s clear that you claim that unaligned AIs are plausibly comparably utilitarian as humans, maybe more.
What I didn’t find was discussion of how contingent utilitarianism is in humans.
Though actually rereading your comment (which I should have done in addition to reading the post) I realize I completely misunderstood what you meant by “contingent”, which explains why I didn’t find it in the post (I thought of it as meaning “historically contingent”). Sorry for the misunderstanding.
If I had to pick a second consideration I’d go with:
After millions of years of life (or much more) and massive amounts of cognitive enhancement, the way post-humans might act isn’t clearly well predicted by just looking at their current behavior.
Again, I’d like to stress that my claim is:
Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won’t be coming from incidental consumption.
One additional meta-level point which I think is important: I think that existing writeups of why human control would have more moral value than unaligned AI control from a longtermist perspective are relatively weak and often specific writeups are highly flawed. (For some discussion of flaws, see this sequence.)
I just think that this write-up misses what seem to me to be key considerations, I’m not claiming that existing work settles the question or is even robust at all.
And it’s somewhat surprising and embarassing that this is the state of the current work given that longtermism is reasonably common and arguments for working on AI x-risk from a longtermist perspective are also common.
It seems odd to me that you don’t focus almost entirely on this sort of argument when considering total utilitarian style arguments.
I feel I did consider this argument in detail, including several considerations that touch on the arguments you gave. However, I primarily wanted to survey the main points that people have previously given me, rather than focusing heavily on a small set of arguments that someone like you might consider to be the strongest ones. And I agree that I may have missed some important considerations in this post.
In regards to your specific points, I generally find your arguments underspecified because, while reading them, it is difficult for me to identify a concrete mechanism for why alignment with human preferences creates astronomically more value from a total utilitarian perspective relative to the alternative. As it is, you seem to have a lot of confidence that human values, upon reflection, would converge onto values that would be far better in expectation than the alternative. However, I’m not a moral realist, and by comparison to you, I think I don’t have much faith in the value of moral reflection, absent additional arguments.
My speculative guess is that part of this argument comes from simply defining “human preferences” as aligned with utilitarian objectives. For example, you seem to think that aligning AIs would help empower the fraction of humans who are utilitarians, or at least would become utilitarians on reflection. But as I argued in the post, the vast majority of humans are not total utilitarians, and indeed, anti-total utilitarian moral intuitions are quite common among humans, which would act against the creation of large amounts of utilitarian value in an aligned scenario.
These are my general thoughts on what you wrote, although I admit I have not responded in detail to any of your specific arguments, and I think you did reveal a genuine blindspot in the arguments I gave. I may write a comment at some future point that considers your comment more thoroughly.
As it is, you seem to have a lot of confidence that human values, upon reflection, would converge onto values that would be far better in expectation than the alternative. However, I’m not a moral realist, and by comparison to you, I think I don’t have much faith in the value of moral reflection, absent additional arguments.
I’m assuming some level of moral-quasi realism: I care about what I would think is good after reflecting on the situation for a long time and becoming much smarter.
For more on this perspective consider: this post by Holden. I think there is a bunch of other discussion elsewhere from Paul Christiano and Joe Carlsmith, but I can’t find posts immediately.
I think the case for being a moral-quasi realist is very strong and depends on very few claims.
My speculative guess is that part of this argument comes from simply defining “human preferences” as aligned with utilitarian objectives.
Not exactly, I’m just defining “the good” as something like “what I would think was good after following a good reflection process which doesn’t go off the rails in an intuitive sense”. (Aka moral-quasi realism.)
I’m not certain that after reflection I would end up at something which is that well described as utilitarian. Something vaguely in the ball park seems plausible though.
But as I argued in the post, the vast majority of humans are not total utilitarians, and indeed, anti-total utilitarian moral intuitions are quite common among humans, which would act against the creation of large amounts of utilitarian value in an aligned scenario
A reasonable fraction of my view is that many of the moral intuitions of humans might mostly be biases which end up not being that important if people decide to thoughtfully reflect. I predict that humans converge more after reflection and becoming much, much smarter. I don’t know exactly what humans converge towards, but it seems likely that I converge toward a cluster which benefits from copious amounts of resources and which has reasonable support among the things which humans think on reflection.
I’m assuming some level of moral-quasi realism: I care about what I would think is good after reflecting on the situation for a long time and becoming much smarter.
Depending on the structure of this meta-ethical view, I feel like you should be relatively happy to let unaligned AIs do the reflection for you in many plausible circumstances. The intuition here is that if you are happy to defer your reflection to other humans, such as future humans who will replace us in the future, then you should potentially also be open to deferring your reflection to a large range of potential other beings, including AIs who might initially not share human preferences, but would converge to the same ethical views that we’d converge to.
In other words, in contrast to a hardcore moral anti-realist (such as myself) who doesn’t value moral reflection much, you seem happier to defer this reflection process to beings who don’t share your consumption or current ethical preferences. But you seem to think it’s OK to defer to humans but not unaligned AIs, implicitly drawing a moral distinction on the basis of species. Whereas I’m concerned that if I die and get replaced by either humans or AIs, my goals will not be furthered, including in the very long-run.
What is it about the human species exactly that makes you happy to defer your values to other members of that species?
Not exactly, I’m just defining “the good” as something like “what I would think was good after following a good reflection process which doesn’t go off the rails in an intuitive sense”. (Aka moral-quasi realism.)
I think I have a difficult time fully understanding your view because I think it’s a little underspecified. In my view, there seem to be a vast number of different ways that one can “reflect”, and intuitively I don’t think all (or even most) of these processes will converge to roughly the same place. Can you give me intuitions for why you hold this meta-ethical view? Perhaps you can also be more precise about what you see as the central claims of moral quasi-realism.
Depending on the structure of this meta-ethical view, I feel like you should be relatively happy to let unaligned AIs do the reflection for you in many plausible circumstances.
I’m certainly happy if we get to the same place. I think I have feel less good about the view the more contingent it is.
In other words, in contrast to a hardcore moral anti-realist (such as myself) who doesn’t value moral reflection much, you seem happier to defer this reflection process to beings who don’t share your consumption or current ethical preferences. But you seem to think it’s OK to defer to humans but not unaligned AIs, implicitly drawing a moral distinction on the basis of species.
I mean, I certainly think you lose some value from it being other humans. My guess is that you lose more like 5-20x of the value from my perspective with humans than like 1000x and that this 5-20x of the value lost is more like 20-100x for unaligned AI.
I think I have a difficult time fully understanding your view because I think it’s a little underspecified. In my view, there seem to be a vast number of different ways that one can “reflect”, and intuitively I don’t think all (or even most) of these processes will converge to roughly the same place. Can you give me intuitions for why you hold this meta-ethical view? Perhaps you can also be more precise about what you see as the central claims of moral quasi-realism.
I think my views about what I converge to are distinct about my views on quasi-realism. I think a weak notion of quasi-realism is extremely intuitive: you would do better things if you thought more about what would be good (at least relatively to the current returns, eventually returns to thinking would saturate). Because e.g., there are interesting empirical facts (where did my current biases come from evolutionarily? what are brains doing?) I’m not claiming that quasi-realism implies my conclusions, just that it’s an important part of where I’m coming from.
I separately think that reflection and getting smarter are likely to cause convergence due to a variety of broad intuitions and some vague historical analysis. I’m not hugely confident in this, but I’m confident enough to think the expect value looks pretty juicy.
I disagree with quite a few points in the total utilitarianism section, but zooming out slightly, I think that total utilitarians should generally still support alignment work (and potentially an AI pause/slow down) to preserve option value. If it turns out that AIs are moral patients and that it would be good for them to spread into the universe optimising for values that don’t look particularly human, we can still (in principle) do that. This is compatible with thinking that alignment from a total utilitarian perspective is ~neutral—but it’s not clear that you agree with this from the post.
I think the problem with this framing is that it privileges a particular way of thinking about option value that prioritizes the values of the human species in a way I find arbitrary.
In my opinion, the choice before the current generation is not whether to delay replacement by a different form of life, but rather to choose our method of replacement: we can either die from old age over decades and be replaced by the next generation of humans, or we can develop advanced AI and risk being replaced by them, but also potentially live much longer and empower our current generation’s values.
Deciding to delay AI is not a neutral choice. It only really looks like we’re preserving option value in the first case if you think there’s something great about the values of the human species. But then if you think that the human species is special, I think these arguments are adequately considered in the first and second sections of my post.
Hmm, maybe I’ll try to clarify what I think you’re arguing as I predict it will be confusing to caleb and bystanders. The way I would have put this is:
It only preserves option value from your perspective to the extent that you think humanity overall[1] will have a similar perspective as you and will make resonable choices. Matthew seems to think that humanity will use ~all of the resources on (directly worthless?) economic consumption such that the main source of value (from a longtermist, scope sensitive, utilitarian-ish perspective) will be from the minds of the laborers that produce the goods for this consumption. Thus, there isn’t any option value as almost all the action is coming from indirect value rather than from people trying to produce value.
I disagree strongly with Matthew on this view about where the value will come from in expectation insofar as that is an accurate interpretation. (I elaborate on why in this comment.) I’m not certain about this being a correct interpretation of Matthew’s views, but it at least seems heavily implied by:
Consequently, in a scenario where AIs are aligned with human preferences, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations. To put it another way, the key factor influencing whether AIs are conscious in this scenario will be the relative efficiency of creating conscious AIs compared to unconscious ones for producing the goods and services demanded by future people. As these efficiency factors are likely to be similar in both aligned and unaligned scenarios, we are led to the conclusion that, from a total utilitarian standpoint, there is little moral difference between these two outcomes.
It only preserves option value from your perspective to the extent that you think humanity will have a similar perspective as you and will make resonable choices. Matthew seems to think that humanity will use ~all of the resources on economic consumption such that the main source of value (from a longtermist, scope sensitive, utilitarian-ish perspective) will be from the minds of the laborers that produce the goods for this consumption.
I agree with your first sentence as a summary of my view.
The second sentence is also roughly accurate[ETA: see comment below for why I am no longer endorsing this], but I do not consider it to be a complete summary of the argument I gave in the post. I gave additional reasons for thinking that the values of the human species are not special from a total utilitarian perspective. This included the point that humans are largely not utilitarians, and in fact frequently have intuitions that would act against the recommendations of utilitarianism if their preferences were empowered. I elaborated substantially on this point in the post.
On second thought, regarding the second sentence, I think I want to take back my endorsement. I don’t necessarily think the main source of value will come from the minds of AIs who labor, although I find this idea plausible depending on the exact scenario. I don’t really think I have a strong opinion about this question, and I didn’t see my argument as resting on it. And so I’d really prefer it not be seen as part of my argument (and I did not generally try to argue this in the post).
Really, my main point was that I don’t actually see much of a difference between AI consumption and human consumption, from a utilitarian perspective. Yet, when thinking about what has moral value in the world, I think focusing on consumption in both cases is generally correct. This includes considerations related to incidental utility that comes as a byproduct from consumption, but the “incidental” part here is not a core part of what I’m arguing.
>I think the problem with this framing is that it privileges a particular way of thinking about option value that prioritizes the values of the human species in a way I find arbitrary.
I think it’s in the same category of “don’t do crime for utilitarian reasons”? Like, if you are not seeing that (trans-)humans are preferable, you are at odds with lots of people who do see it. (and, like, with me personally) Not moustache twirling level of villaining, but you know… you need to be careful with this stuff. You probably don’t want to be that part of ea that is literally plotting downfall of human civilization
I feel like this goes against the principle of not leaving your footprint on the future, no?
Like, a large part of what I believe to be the danger with AI is that we don’t have any reflective framework for morality. I also don’t believe the standard path for AGI is one of moral reflection. This would then to me say that we leave the value of the future up to market dynamics and this doesn’t seem good with all the traps there are in such a situation? (Moloch for example)
If we want a shot at a long reflection or similar, I don’t think full sending AGI is the best thing to do.
I feel like this goes against the principle of not leaving your footprint on the future, no?
A major reason that I got into longtermism in the first place is that I’m quite interested in “leaving a footprint” on the future (albeit a good one). In other words, I’m not sure I understand the intuition for why we wouldn’t deliberately try to leave our footprints on the future, if we want to have an impact. But perhaps I’m misunderstanding the nature of this metaphor. Can you elaborate?
I also don’t believe the standard path for AGI is one of moral reflection.
I think it’s worth being more specific about why you think AGI will not do moral reflection? In the post, I carefully consider arguments about whether future AIs will be alien-like and have morally arbitrary goals, in a respect that you seem to be imagining. I think it’s possible that I addressed some of the intuitions behind your argument here.
I guess I felt that a lot of the post was arguing under a frame of utilitarianism which is generally fair I think. When it comes to “not leaving a footprint on the future” what I’m referring to is epistemic humility about the correct moral theories. I’m quite uncertain myself about what is correct when it comes to morality with extra weight on utilitarianism. From this, we should be worried about being wrong and therefore try our best to not lock in whatever we’re currently thinking. (The classic example being if we did this 200 years ago we might still have slaves in the future)
I’m a believer that virtue ethics and deontology are imperfect information approximations of utilitarianism. Like Kant’s categorical imperative is a way of looking at the long-term future and asking, how do we optimise society to be the best that it can be?
I guess a core crux here for me is that it seems like you’re arguing a bit for naive utilitarianism here. I actually don’t really believe the idea that we will have the AGI follow the VNM-axioms that is being fully rational. I think it will be an internal dynamic system that are weighing for different things that it wants and that it won’t fully maximise utility because it won’t be internally aligned. Therefore we need to get it right or we’re going to have weird and idiosyncratic values that are not optimal for the long-term future of the world.
I hope that makes sense, I liked your post in general.
I see. After briefly skimming that post, I think I pretty strongly disagree with just about every major point in it (along with many of its empirical background assumptions), although admittedly I did not spend much time reading through it. If someone thinks that post provides good reasons to doubt the arguments in my post, I’d likely be happy to discuss the specific ideas within it in more detail.
Executive summary: From a total utilitarian perspective, the value of AI alignment work is unclear and plausibly neutral, while from a human preservationist or near-termist view, alignment is clearly valuable but significantly delaying AI is more questionable.
Key points:
Unaligned AIs may be just as likely to be conscious and create moral value as aligned AIs, so alignment work is not clearly valuable from a total utilitarian view.
Human moral preferences are a mix of utilitarian and anti-utilitarian intuitions, so empowering them may not be better than an unaligned AI scenario by utilitarian lights.
From a human preservationist view, alignment is clearly valuable since it would help ensure human survival, but this view rests on speciesist foundations.
A near-termist view focused on benefits to people alive today would value alignment but not significantly delaying AI, since that could deprive people of potentially massive gains in wealth and longevity.
Arguments for delaying AI to reduce existential risk often conflate the risk of human extinction with the risk of human replacement by AIs, which are distinct from a utilitarian perspective.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, andcontact us if you have feedback.
Great post, Matthew! Misaligned AI not being clearly bad is one of the reasons why I have been moving away from AI safety to animal welfare as the most promising cause area. In my mind, advanced AI would ideally be aligned with expectedtotalhedonisticutilitarianism.
Thank you for writing this. I broadly agree with the perspective and find it frustrating how often it’s dismissed based on (what seem to me) somewhat-shaky assumptions.
A few thoughts, mainly on the section on total utilitarianism:
1. Regarding why people tend to assume unaligned AIs won’t innately have any value, or won’t be conscious: my impression is this is largely due to the “intelligence as optimisation process” model that Eliezer advanced. Specifically, that in this model, the key ability humans have that enables us to be so successful is our ability to optimise for goals; whereas mind features we like, such as consciousness, joy, curiosity, friendship, and so on are largely seen as being outside this optimisation ability, and are instead the terminal values we optimise for. (Also that none of the technology we have so far built has really affected this core optimisation ability, so once we do finally build an artificial optimiser it could very well quickly become much more powerful than us, since unlike us it might be able to improve its optimisation ability.)
I think people who buy this model will tend not to be moved much by observations like consciousness having evolved multiple times, as they’d think: sure, but why should I expect that consciousness is part of the optimisation process bit of our minds, specifically? Ditto for other mind features, and also for predictions that AIs will be far more varied than humans — there just isn’t much scope for variety or detail in the process of doing optimisation. You use the phrase “AI civilisation” a few times; my sense is that most people who expect disaster from unaligned AI would say their vision of this outcome is not well-described as a “civilisation” at all.
2. I agree with you that if the above model is wrong (which I expect it is), and AIs really will be conscious, varied, and form a civilisation rather than being a unified unconscious optimiser, then there is some reason to think their consumption will amount to something like “conscious preference satisfaction”, since a big split between how they function when producing vs consuming seems unlikely (even though it’s logically possible).
I’m a bit surprised though by your focus (as you’ve elaborated on in the comments) on consumption rather than production. For one thing, I’d expect production to amount to a far greater fraction of AIs’ experience-time than consumption, I guess on the basis that production enables more subsequent production (or consumption), whereas consumption doesn’t, it just burns resources.
Also, you mentioned concerns about factory farms and wild animal suffering. These seem to me describable as “experiences during production” — do you not have similar concerns regarding AIs’ productive activities? Admittedly pain might not be very useful for AIs, as plausibly if you’re smart enough to see the effects on your survival of different actions, then you don’t need such a crude motivator — even humans trying very hard to achieve goals seem to mostly avoid pain while doing so, rather than using it to motivate themselves. But emotions like fear and stress seem to me plausibly useful for smart minds, and I’d not be surprised if they were common in an AI civilisation in a world where the “intelligence as optimisation process” model is not true. Do you disagree, or do you just think they won’t spend much time producing relative to consuming, or something else?
(To be clear, I agree this second concern has very little relation to what’s usually termed “AI alignment”, but it’s the concern re: an AI future that I find most convincing, and I’m curious on your thoughts on it in the context of the total utilitarian perspective.)
Thank you for writing this. I broadly agree with the perspective and find it frustrating how often it’s dismissed based on (what seem to me) somewhat-shaky assumptions.
Thanks. I agree with what you have to say about effective altruists dismissing this perspective based on what seem to be shaky assumptions. To be a bit blunt, I generally find that, while effective altruists are often open to many types of criticism, the community is still fairly reluctant to engage deeply with some ideas that challenge their foundational assumptions. This is one of those ideas.
But I’m happy to see this post is receiving net-positive upvotes, despite the disagreement. :)
This is really useful in that it examines critically what I think of as the ‘orthodox view’: alignment is good because it ‘allows humans to preserve control over the future’. This view feels fundamental but underexamined, in much of the EA/alignment world (with notable exceptions: Rich Sutton, Robin Hanson, Joscha Bach who seem species-agnostic; Paul Christiano has also fleshed out his position e.g. this part of a Dwarkesh Patel podcast).
A couple of points I wasn’t sure I understood/agreed with FWIW:
a) A relatively minor one is
To the extent you think that future AIs would not be capable of creating massive wealth for humans, or extending their lifespans, this largely implies that you think future AIs will not be very powerful, smart, or productive. Thus, by the same argument, we should also not think future AIs will be capable of making humanity go extinct.
I’m not sure about about this symmetry—I can imagine an LLM (~GPT-5 class) integrated into a nuclear/military decision-making system that could cause catastrophic death/suffering (millions/billions of immediate/secondary deaths, massive technological setback, albeit not literal extinction). I’m assuming the point doesn’t hinge on literal extinction.
b) Regarding calebp’s comment on option value: I agree most option value discussion (doesn’t seem to be much outside Bostrom and the s-risk discourse) assumes continuation of the human species, but I wonder if there is room for a more cosmopolitan framing: ‘Humans are our only example of an advanced technological civilisation, that might be on the verge of a step change in their evolution. The impact of this evolutionary step-change on the future can arguably be (on balance) good (definition of “good” tbd). The “option value” we are trying to preserve is less the existence of humans per-se, but rather the possibility of such an evolution happening at all. Put another way, we don’t an to prematurely introduce an unaligned or misaligned AI (perhaps a weak one) that causes extinction, a bad lock-in, or prevents emergence of more capable AIs that could have achieved this evolutionary transition.’
In other words, the option value is not over the number of human lives (or economic value) but rather over the possible trajectories of the future...this does not seem particularly species-specific. It just says that we should be careful not to throw these futures away.
c) point (b) hinges on why human evolution is ‘good’ in any broad or inclusive sense (outside of letting current and near-current generations live wealthier, longer lives, if indeed those are good things).
In order to answer this, it feels like we need some way of defining value ‘from the point of view of the universe’. That particular phrase is a Sidgwick/Singer thing, and I’m not sure it is directly applicable in this context (like similar phrases e.g. Nagel’s ‘view from nowhere’), but without this it is very hard to talk about non-species based notions of value (i.e. standard utilitarianism, deontological/virtue approaches all basically rely on human on animal beings).
My candidate for this ‘cosmic value’ is something like created complexity (which can be physical or not, and can include things that are not obviously economically/militarily/reproductively valuable like art). This includes having trillions of diverse computing entities (human or otherwise).
This is obviously pretty hand-wavey, but I’d be interested in talking to anyone with views (it’s basically my PhD :-)
I think this post misses the key considerations for perspective (1): longtermist-style scope sensitive utilitarianism. In this comment, I won’t make a positive case for the value of preventing AI takeover from a perspective like (1), but I will argue why I think the discussion in this post mostly misses the point.
(I separately think that preventing unaligned AI control of resources makes sense from perspective (1), but you shouldn’t treat this comment as my case for why this is true.)
You should treat this comment as (relatively : )) quick and somewhat messy notes rather than a clear argument. Sorry, I might respond to this post in a more clear way later. (I’ve edited this comment to add some considerations which I realized I neglected.)
I might be somewhat biased in this discussion as I work in this area and there might be some sunk costs fallacy at work.
First:
It seems odd to me that you don’t focus almost entirely on this sort of argument when considering total utilitarian style arguments. Naively these views are fully dominated by the creation of new entities who are far more numerous and likely could be much more morally valuable than economically productive entities. So, I’ll just be talking about a perspective basically like this perspective where creating new beings with “good” lives dominates.
With that in mind, I think you fail to discuss a large number of extremely important considerations from my perspective:
Over time (some subset of) humans (and AIs) will reflect on their views and perferences and will consider utilizing resources in different ways.
Over time (some subset of) humans (and AIs) will get much, much smarter or more minimally receive advice from entities which are much smarter.
It seems likely to me that the vast, vast majority of moral value (from this sort of utilitarian perspective) will be produced via people trying to improve to improve moral value rather than incidentally via economic production. This applies for both aligned and unaligned AI. I expect that only a tiny fraction of available comptuation goes toward optimizing economic production and that only a smaller fraction of this is morally relevant and that the weight on this moral relevance is much lower than being specifically optimize for moral relevance when operating from a similar perspective. This bullet is somewhere between a consideration and a claim, though it seems like possibly our biggest disagreement. I think it’s possible that this disagreement is driven by some of the other considerations I list.
Exactly what types of beings are created might be much more important than quantity.
Ultimately, I don’t care about a simplified version of total utilitarianism, I care about what preferences I would endorse on reflection. There is a moderate a priori argument for thinking that other humans which bother to reflect on their preferences might end up in a similar epistemic state. And I care less about the preferences which are relatively contingent among people who are thoughtful about reflection.
Large fractions of current wealth of the richest people are devoted toward what they claim is altruism. My guess is that this will increase over time.
Just doing a trend extrapolation on people who state an interest in reflection and scope sensitive altruism already indicates a non-trivial fraction of resources if we weight by current wealth/economic power. (I think, I’m not totally certain here.) This case is even stronger if we consider groups with substantial influence over AI.
Being able to substantially effect the preference of (at least partially unaligned) AIs that will seize power/influence still seems extremely leveraged under perspective (1) even if we accept the arguments in your post. I think this is less leveraged than retaining human control (as we could always later create AIs with the preferences we desire and I think people with a similar perspective to me will have substantial power). However, it is plausible that under your empirical views the dominant question in being able to influence the preferences of these AIs is whether you have power, not whether you have technical approaches which suffice.
I think if I had your implied empirical views about how humanity and unaligned AIs use resources I would be very excited for a proposal like “politically agitate for humanity to defer most resources to an AI successor which has moral views that people can agree are broadly reasonable and good behind the veil of ignorance”. I think your views imply that massive amounts of value are left on the table in either case such that humanity (hopefully willingly) forfeiting control to a carefully constructed successor looks amazingly.
Humans who care about using vast amounts of computation might be able to use their resources to buy this computation from people who don’t care. Suppose 10% of people (really resources weighed people) care about reflecting on their moral views and doing scope sensitive altruism of a utilitarian bent and 90% of people care about jockeying for status without reflecting on their views. It seems plausible to me that the 90% will jocky for status via things that consume relatively small amounts of computation via things like buying fancier pieces of land on earth or the coolest looking stars while the 10% of people who care about using vast amounts of computation can buy this for relatively cheap. Thus, most of the computation will go to those who care. Probably most people who don’t reflect and buy purely positional goods will care less about computation than things like random positional goods (e.g. land on earth which will be bid up to (literally) astronomical prices). I could see fashion going either way, but it seems like computation as a dominant status good seems unlikely unless people do heavy reflection. And if they heavily reflect, then I expect more altruism etc.
Your preference based arguments seem uncompelling to me because I expect that the dominant source of beings won’t be due to economic production. But I also don’t understand a version of preference utilitarianism which seems to match what you’re describing, so this seems mostly unimportant.
Given some of our main disagreements, I’m curious what you think humans and unaligned AIs will be economically consuming.
Also, to be clear, none of the considerations I listed make a clear and strong case for unaligned AI being less morally valuable, but they do make the case that the relevant argument here is very different from the considerations you seem to be listing. In particular, I think value won’t be coming from incidental consumption.
If you could highlight only one consideration that you think I missed in my post, which one would you highlight? And (to help me understand it) can you pose the consideration in the form of an argument, in a way that directly addresses my thesis?
Hmm, this is more of a claim then a consideration but I’d highlight:
The main thing this claim disputes is:
(and some related points).
Sorry, I don’t think this exactly addresses your comment. I’ll maybe try to do a better job in a bit. I think a bunch of the considerations I mention are relatively diffuse, but important in aggregate.
Maybe the most important single consideration is something like:
Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).
So, we should focus on the question of entities trying to create morally valuable lives (or experience or whatever relevant similar property we care about) and then answer this.
(You do seem to talk about “will AIs have more/less utilitarian impulses than humans”, but you seem to talk about this almost entirely from the perspective of growing the economy rather than question like how good the lives will be.)
Do you have an argument for why humans are more likely to try to create morally valuable lives compared to unaligned AIs?
I personally feel I addressed this particular question already in the post, although I framed it slightly differently than you have here. So I’m trying to get a better sense as to why you think my argument in the post about this is weak.
A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn’t seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn’t seem special or rare, so it’s likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they’ll likely share our high-level concepts about morality even if not our exact preferences.
Let me know what considerations you think I’m still missing here.
[ETA: note that after writing this comment, I sharpened the post slightly to make it a little more clear that this was my position in the post, although I don’t think I fundamentally added new content to the post.]
TBC, the main point I was trying to make was that you didn’t seem to be presenting arguments about what seems to me like the key questions. Your summary of your position in this comment seems much closer to arguments about the key questions than I interpreted your post being. I interpreted your post as claiming that most value would result from incidental economic consumption under either humans or unaligned AIs, but I think you maybe don’t stand behind this.
Separately, I think the “maybe AIs/humans will be selfish and/or not morally thoughtful” argument mostly just hits both unaligned AIs and humans equally hard such that it just gets normalized out. And then the question is more about how much you care about the altruistic and morally thoughtful subset.
(E.g., the argument you make in this comment seemed to me like about 1⁄6 of your argument in the post and it’s still only part of the way toward answering the key questions from my perspective. I think I partially misunderstood the emphasis of your argument in the post.)
I do have arguments for why I think human control is more valuable than control by AIs that seized control from humans, but I’m not going to explain them in detail in this comment. My core summary would be something like “I expect substantial convergence among morally thoughtful humans which reflect toward my utilitarian-ish views, I expect notably less convergence between me and AIs. I expect that AIs have somewhat messed up and complex and specific values in ways which might make them not care about things we care about as a results of current training processes, while I don’t have such an arguement for humans.”
As far as I what I do think the the key questions are, I think they are something like:
What do humans/AIs have for preference radically longer lives, massive self-enhancement, and potentially long periods of reflection?
How much do values/views diverge/converge between different altruistically minded humans who’ve thought about it extremely long durations?
Even if various entities are into creating “good experiences” how much do these views diverge in what is the best? My guess would be that even if two entities are maximizing good experiences from their perspective the relative goodness/compute can be much lower for the other entity, (e.g. easily 100x lower, maybe more)
How similar are my views on what is good after reflection to other humans vs AIs?
How much should we care about worlds where morally thoughtful humans reach radically diffent conclusions on reflection?
Structurally, what sorts of preferences do AI training processes impart on AIs conditionally on these AIs successfully seizing power? I also think this is likely despite humanity likely resisting to at least some extent.
It seems like your argument is something like “who knows about AI preferences, also, they’ll probably have similar concepts as we do” and “probably humanity will just have the same observed preferences as they currently do”.
But I think we can get much more specific guesses about AI preferences such that this weak indifference principle seems unimportant and I think human preferences will change radically, e.g. preferences will change far more in the next 10 million than in the last the last 2000 years.
Note that I’m not making an argument for greater value on human control in this comment, just trying to explain why I don’t think your argument is very relevant. I might try to write up something about my overall views here, but it doesn’t seem like my comparative advantage and it currently seems non-urgent from my perspective. (Though embarassing for the field as a whole.)
It’s possible we’re using these words differently, but I guess I’m not sure why you’re downplaying the value of economic consumption here. I focused on economic consumption for a simple reason: economic consumption is intrinsically about satisfying the preferences of agents, including the type of preferences you seem to think matter. For example, I’d classify most human preferences as consumption, including their preference to be happy, which they try to satisfy via various means.
If either a human or an AI optimizes for their own well-being by giving themselves an extremely high intensity positive experience in the future, I don’t think that would be vastly morally outweighed by someone doing something similar but for altruistic reasons. Just because the happiness arises from a selfish motive seems like no reason, by itself, to disvalue it from a utilitarian perspective.
As a consequence, I simply do not agree with the intuition that economic consumption is a rounding error compared to the much smaller fraction of resources spent on altruistic purposes.
I disagree because I don’t see why altruism will be more intense than selfishness from a total utilitarian perspective, in the sense you are describing. If an AI makes themselves happy for selfish reasons, that should matter just as much as an AI creating another AI to make them happy.
Now again, you could just think that AIs aren’t likely to be conscious, or aren’t likely to be motivated to make themselves happy in any sort of selfish sense. And so an unaligned world could be devoid of extremely optimized utilitarian value. But this argument was also addressed at length in my post, and I don’t know what your counterargument is to it.
Ah, sorry, I was referring to the process of the AI labor being used to accomplish the economic output not having much total moral value. I thought you were arguing that aligned AIs being used to produce goods would have be where most value is coming from because of the vast numbers of such AIs laboring relative to other enitites. Sorry by “from incidental economic consumption” I actually meant “incidentally (as a side effect from) economic consumption”. This is in response to things like:
As far as the other thing you say, I still disagree, though for different (related) reasons:
I don’t agree with either “much smaller” and I think rounding error is reasonably likely as far as the selfish preferences of current existing humans or the AIs that seize control go. (These entities might (presumably altruistically) create entities which then selfishly satisfy their preferences, but that seems pretty different.)
My main counterargument is that selfish preference will result in wildly fewer entities if such entities aren’t into (presumably altruistically) making more entities and thus will be extremely inefficient. Of course it’s possible that you have AIs with non-indexical preferences but which are de facto selfish in other ways.
E.g., for humans you have 10^10 beings which are probably radically inefficient at producing moral value. For AIs it’s less clear and depends heavily on how you operationalize selfishness.
I have a general view like “in the future, the main way you’ll get specific things that you might care about is via people trying specifically to make those things because optimization is extremely powerful”.
I’m probably not going to keep responding as I don’t think I’m comparatively advantaged in fleshing this out. And doing this in a comment section seems suboptimal. If this is anyone’s crux for working on AI safety though, consider contacting me and I’ll consider setting you up with someone who I think understands my views and would to go through relevant arguments with you. Same offer applies to you Matthew particularly if this is a crux, but I think we should use a medium other than EA forum comments.
Admittedly I worded things poorly in that part, but the paragraph you quoted was intended to convey how consciousness is most likely to come about in AIs, rather than to say that the primary source of value in the world will come from AIs laboring for human consumption.
These are very subtly different points, and I’ll have to work on making my exposition here more clear in the future (including potentially re-writing that part of the essay).
Note that a small human population size is an independent argument here for thinking that AI alignment might not be optimal from a utilitarian perspective. I didn’t touch on this point in this essay because I thought it was already getting too complex and unwieldy as it was, but the idea here is pretty simple, and it seems you’ve already partly spelled out the argument. If AI alignment causes high per capita incomes (because it enriches humans with a small population size), then plausibly this is worse than having a far larger population of unaligned AIs who have lower per capita consumption, from a utilitarian point of view.
Both seems negligible relative to the expected amount of compute spent on optimized goodness in my view.
Also, I’m not sold that there will be more AIs, it depends on pretty complex details about AI preferences. I think it’s likely AIs won’t have preferences for their own experiences given current training methods and will instead have preferences for causing certain outcomes.
Both will presumably be forms of consumption, which could be in the form of compute spent on optimized goodness. You seem to think compute will only be used for optimized goodness for non-consumption purposes (which is why you care about the small fraction of resources spent on altruism) and I’m saying I don’t see a strong case for that.
I’m also not sold it’s that small.
Regardless, doesn’t seem like we’re making progresss here.
You have no obligation to reply, of course, but I think we’d achieve more progress if you clarified your argument in a concise format that explicitly outlines the assumptions and conclusion.
As far as I can gather, your argument seems to be a mix of assumptions about humans being more likely to optimize for goodness (why?), partly because they’re more inclined to reflect (why?), which will lead them to allocate more resources towards altruism rather than selfish consumption (why is that significant?). Without understanding how your argument connects to mine, it’s challenging to move forward on resolving our mutual disagreement.
Fwiw I had a similar reaction as Ryan.
My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don’t know) rather than humans (where we know some are total utilitarians). In your taxonomy this would be “humans are more likely to optimize for goodness”.
Let’s make a toy model compatible with your position:
Let’s say that there are a million values that one could have with “humanity’s high-level concepts about morality”, one of which is “Rohin’s values”.
For (3), we’ll say that both unaligned AI values and human values are a subset sampled uniformly at random from these million values (all values in the subset weighted equally, for simplicity).
For (1), we’ll say that the sampled human values include “Rohin’s values”, but only as one element in the set of sampled human values.
I won’t make any special distinction about consciousness so (2) won’t matter.
In this toy model you’d expect aligned AI to put 1⁄1,000 weight on “Rohin’s values”, whereas unaligned AI puts 1⁄1,000,000 weight in expectation on “Rohin’s values” (if the unaligned AI has S values, then there’s an S/1,000,000 probability of it containing “Rohin’s values”, and it is weighted 1/S if present). So aligned AI looks a lot better.
More generally, ceteris paribus, keeping values intact prevents drift and so looks strongly positive from the point of view of the original values, relative to resampling values “from scratch”.
(Feel free to replace “Rohin’s values” with “utilitarianism” if you want to make the utilitarianism version of this argument.)
Imo basically everything that Ryan says in this comment thread is a countercounterargument to a counterargument to this basic argument. E.g. someone might say “oh it doesn’t matter which values you’re optimizing for, all of the value is in the subjective experience of the AIs that are laboring to build new chips, not in the consumption of the new chips” and the rebuttal to that is “Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).”
I’m curious: Does your reaction here similarly apply to ordinary generational replacement as well?
Let me try to explain what I’m asking.
We have a set of humans who exist right now. We know that some of them are utilitarians. At least one of them shares “Rohin’s values”. Similar to unaligned AIs, we don’t know the values of the next generation of humans, although presumably they will continue to share our high-level moral concepts since they are human and will be raised in our culture. After the current generation of humans die, the next generation could have different moral values.
As far as I can tell, the situation with regards to the next generation of humans is analogous to unaligned AI in the basic sense I’ve just laid out (mirroring the part of your comment I quoted). So, in light of that, would you similarly say that it’s “pretty wild to think that total utilitarian values would be better served by a future generation of humans”?
One possible answer here: “I’m not very worried about generational replacement causing moral values to get worse since the next generation will still be human.” But if this is your answer, then you seem to be positing that our moral values are genetic and innate, rather than cultural, which is pretty bold, and presumably merits a defense. This position is IMO largely empirically ungrounded, although it depends on what you mean by “moral values”.
Another possible answer is: “No, I’m not worried about generational replacement because we’ve seen a lot of human generations already and we have lots of empirical data on how values change over time with humans. AI could be completely different.” This would be a reasonable response, but as a matter of empirical fact, utilitarianism did not really culturally exist 500 or 1000 years ago. This indicates that it’s plausibly quite fragile, in a similar way it might also be with AI. Of course, values drift more slowly with ordinary generational replacement compared to AI, but the phenomenon still seems roughly pretty similar. So perhaps you should care about ordinary value drift almost as much as you’d care about unaligned AIs.
If you do worry about generational value drift in the strong sense I’ve just described, I’d argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).
To the extent that future generations would have pretty different values than me, like “the only glory is in war and it is your duty to enslave your foes”, along with the ability to enact their values on the reachable universe, in fact that would seem pretty bad to me.
However, I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI. (This doesn’t require values to be innate, values can be adaptive for many human cultures but not for AI cultures.) So I would be less worried about generational value drift (but not completely unworried).
In addition, this worry is tempered even more by the possibility that values / culture will be set much more deliberately in the nearish future, rather than via culture, simply because with an intelligence explosion that becomes more possible to do than it is today.
Huh? I feel very confused about this, even if we grant the premise. Isn’t the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren’t going to enact large-scale change? Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?
To clarify, I think it’s a reasonable heuristic that, if you want to preserve the values of the present generation, you should try to minimize changes to the world and enforce some sort of stasis. This could include not building AI. However, I believe you may be glossing over the distinction between: (1) the values currently held by existing humans, and (2) a more cosmopolitan, utilitarian ethical value system.
We can imagine a wide variety of changes to the world that would result in a vast changes to (1) without necessarily being bad according to (2). For example:
We could start doing genetic engineering of humans.
We could upload humans onto computers.
A human-level, but conscious, alien species could immigrate to Earth via a portal.
In each scenario, I agree with your intuition that “the correlation between my values and future humans is higher than the correlation between my values and X-values, because I share much more background with future humans than with X”, where X represents the forces at play in each scenario. However, I don’t think it’s clear that the resulting change to the world would be net negative from the perspective of an impartial, non-speciesist utilitarian framework.
In other words, while you’re introducing something less similar to us than future human generations in each scenario, it’s far from obvious whether the outcome will be relatively worse according to utilitarianism.
Based on your toy model, my guess is that your underlying intuition is something like, “The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it’s very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%.”
If this captures your view, my main response is that it seems to assume a much narrower and more fragile conception of “cosmopolitan utilitarian values” than the version I envision, and it’s not a moral perspective I currently find compelling.
Conversely, if you’re imagining a highly contingent, fragile form of utilitarianism that regards the world as far worse under a wide range of changes, then I’d argue we also shouldn’t expect future humans to robustly hold such values. This makes it harder to claim the problem of value drift is much worse for AI compared to other forms of drift, since both are simply ways the state of the world could change, which was the point of my previous comment.
I’m not sure I understand which part of the idea you’re confused about. The idea was simply:
Let’s say that your view is that generational value drift is very risky, because future generations could have much worse values from the ones you care about (relative to the current generation)
In that case, you should try to do what you can to stop generational value drift
One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
This would look quite similar to the moral view in which you’re trying to protect the current generation of humans, which was the third moral view I discussed in the post.
The reason the priority would be on current people rather than those with similar values is that, by assumption, future generations will have different values due to value drift. Therefore, the ~best strategy to preserve current values would be to preserve existing people. This seems relatively straightforward to me, although one could certainly question the premise of the argument itself.
Let me know if any part of the simplified argument I’ve given remains unclear or confusing.
No, this was purely to show why, from the perspective of someone with values, re-rolling those values would seem bad, as opposed to keeping the values the same, all else equal. In any specific scenario, (a) all else won’t be equal, and (b) the actual amount of worry depends on the correlation between current values and re-rolled values.
The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that’s what you were arguing (e.g. when you say things like “humans are largely not utilitarians themselves”). I don’t have a strong view on this and I don’t think it really matters for the positions I take.
The first two seem broadly fine, because I still expect high correlation between values. (Partly because I think that cosmopolitan utilitarian-ish values aren’t fragile.)
The last one seems more worrying than human-level unaligned AI (more because we have less control over them) but less worrying than unaligned AI in general (since the aliens aren’t superintelligent).
Note I’ve barely thought about these scenarios, so I could easily imagine changing my mind significantly on these takes. (Though I’d be surprised if it got to the point where I thought it was comparable to unaligned AI, in how much the values could stop correlating with mine.)
It seems way better to simply try to spread your values? It’d be pretty wild if the EA field-builders said “the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours”.
I think there may have been a misunderstanding regarding the main point I was trying to convey. In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.
When you quoted me saying “humans are largely not utilitarians themselves,” I intended this point to support the idea that our current rough level of utilitarianism is not contingent, rather than the opposite claim. In other words, I meant that the fact that humans are not highly utilitarian suggests that this level of utilitarianism is not unusual or contingent upon specific circumstances, and we might expect other intelligent beings, such as aliens or AIs, to exhibit similar, or even greater, levels of utilitarianism.
Compare to the hypothetical argument: humans aren’t very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.
(This argument is analogous because pyramids are simple structures that lots of different civilizations would likely stumble upon. Similarly, I think “try to create lots of good conscious experiences” is also a fairly simple directive, if indeed aliens/AIs/whatever are actually conscious themselves.)
I think the question of whether utilitarianism is contingent or not matters significantly for our disagreement, particularly if you are challenging my post or the thesis I presented in the first section. If you are very uncertain about whether utilitarianism is contingent in the sense that is relevant to this discussion, then I believe that aligns with one of the main points I made in that section of my post.
Specifically, I argued that the degree to which utilitarianism is contingent vs. common among a wide range of intelligent beings is highly uncertain and unclear, and this uncertainty is an important consideration when thinking about the values and behaviors of advanced AI systems from a utilitarian perspective. So, if you are expressing strong uncertainty on this matter, that seems to support one of my central claims in that part of the post.
(My view, as expressed in the post, is that unaligned AIs have highly unclear utilitarian value but there’s a plausible scenario where they are roughly net-neutral, and indeed I think there’s a plausible scenario where they are even more valuable than humans, from a utilitarian point of view.)
I think this part of your comment plausibly confuses two separate points:
How to best further your own values
How to best further the values of the current generation.
I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1). That said, to the extent that your values simply mirror the values of your generation, I don’t understand your argument for why trying to spread your values would be “way better” than trying to preserve the current generation. Perhaps you can elaborate?
Given my new understanding of the meaning of “contingent” here, I’d say my claims are:
I’m unsure about how contingent the development of utilitarianism in humans was. It seems quite plausible that it was not very historically contingent. I agree my toy model does not accurately capture my views on the contingency of total utilitarianism.
I’m also unsure how contingent it is for unaligned AI, but aggregating over my uncertainty suggests more contingent.
One way to think about this is to ask: why are any humans utilitarians? To the extent it’s for reasons that don’t apply to unaligned AI systems, I think you should feel like it is less likely for unaligned AI systems to be utilitarians. So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the “true explanation” will have to appeal to more than simplicity, and the additional features this “true explanation” will appeal to are very likely to differ between humans and AIs.)
Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was “maybe pyramids were especially easy to build historically”.)
General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect “my values” (which I’m uncertain about and may not reflect utilitarianism) than a world with unaligned AI. But I’m happy to continue talking about the implications for utilitarians.
Thanks for trying to better understand my views. I appreciate you clearly stating your reasoning in this comment, as it makes it easier for me to directly address your points and explain where I disagree.
You argued that feeling pleasure and pain, as well as having empathy, are important factors in explaining why some humans are utilitarians. You suggest that to the extent these reasons for being utilitarian don’t apply to unaligned AIs, we should expect it to be less likely for them to be utilitarians compared to humans.
However, a key part of the first section of my original post was about whether unaligned AIs are likely to be conscious—which for the purpose of this discussion, seems roughly equivalent to whether they will feel pleasure and pain. I concluded that unaligned AIs are likely to be conscious for several reasons:
Consciousness seems to be a fairly convergent function of intelligence, as evidenced by the fact that octopuses are widely accepted to be conscious despite sharing almost no homologous neural structures with humans. This suggests consciousness arises somewhat robustly in sufficiently sophisticated cognitive systems.
Leading theories of consciousness from philosophy and cognitive science don’t appear to predict that consciousness will be rare or unique to biological organisms. Instead, they tend to define consciousness in terms of information processing properties that AIs could plausibly share.
Unaligned AIs will likely be trained in environments quite similar to those that gave rise to human and animal consciousness—for instance, they will be trained on human cultural data and, in the case of robots, will interact with physical environments. The evolutionary and developmental pressures that gave rise to consciousness in biological organisms would thus plausibly apply to AIs as well.
So in short, I believe unaligned AIs are likely to feel pleasure and pain, for roughly the reasons I think humans and animals do. Their consciousness would not be an improbable or fragile outcome, but more likely a robust product of being a highly sophisticated intelligent agent trained in environments similar to our own.
I did not directly address whether unaligned AIs would have empathy, though I find this fairly likely as well. At the very least, I expect they would have cognitive empathy—the ability to model and predict the experiences of others—as this is clearly instrumentally useful. They may lack affective empathy, i.e. the ability to share the emotions of others, which I agree could be important here. But it’s notable that explicit utilitarianism seems, anecdotally, to be more common among people on the autism spectrum, who are characterized as having reduced affective empathy. This suggests affective empathy may not be strongly predictive of utilitarian motivations.
Let’s say you concede the above points and say: “OK I concede that unaligned AIs might be conscious. But that’s not at all assured. Unaligned AIs might only be 70% likely to be conscious, whereas I’m 100% certain that humans are conscious. So there’s still a huge gap between the expected value of unaligned AIs vs. humans under total utilitarianism, in a way that overwhelmingly favors humans.”
However, this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans, or have an even stronger tendency towards utilitarian motivations. This could be the case if, for instance, AIs are more cognitively sophisticated than humans or are more efficiently designed in a morally relevant sense. Given that the vast majority of humans do not seem to be highly motivated by utilitarian considerations, it doesn’t seem like an unlikely possibility that AIs could exceed our utilitarian inclinations. Nor does it seem particularly unlikely that their minds could have a higher density of moral value per unit of energy, or matter.
We could similarly examine this argument in the context of considering other potential large changes to the world, such as creating human emulations, genetically engineered humans, or bringing back Neanderthals from extinction. In each case, I do not think the (presumably small) probability that the entities we are adding to the world are not conscious constitutes a knockdown argument against the idea that they would add comparable utilitarian value to the world compared to humans. The main reason is because these entities could be even better by utilitarian lights than humans are.
This seems minor, but I think the relevant claim is whether AIs would build more pyramids going forward, compared to humans, rather than comparing to historical levels of pyramid construction among humans. If pyramids were easy to build historically, but this fact is no longer relevant, then that seems true now for both humans and AIs, into the foreseeable future. As a consequence it’s hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization. By essentially the same arguments I gave above about utilitarianism, I don’t think there’s a strong argument for thinking that aligning AIs is good from the perspective of pyramid maximization.
This makes sense to me, but it’s hard to say much about what’s good from the perspective of your values if I don’t know what those values are. I focused on total utilitarianism in the post because it’s probably the most influential moral theory in EA, and it’s the explicit theory used in Nick Bostrom’s influential article Astronomical Waste, and this post was partly intended as a reply to that article (see the last few paragraphs of the post).
I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I’d feel pretty surprised if this were true in whatever distribution over unaligned AIs we’re imagining. In particular, I think if there’s no particular reason to expect affective empathy in unaligned AIs, then your prior on it being present should be near-zero (simply because there are lots of specific claims about unaligned AIs about that complicated most of which will be false). And I’d be surprised if “zero vs non-zero affective empathy” was not predictive of utilitarian motivations.
I definitely agree that AIs might feel pleasure and pain, though I’m less confident in it than you seem to be. It just seems like AI cognition could be very different from human cognition. For example, I would guess that pain/pleasure are important for learning in humans, but it seems like this is probably not true for AI systems in the current paradigm. (For gradient descent, the learning and the cognition happen separately—the AI cognition doesn’t even get the loss/reward equivalent as an input so cannot “experience” it. For in-context learning, it seems very unclear what the pain/pleasure equivalent would be.)
I agree this is possible. But ultimately I’m not seeing any particularly strong reasons to expect this (and I feel like your arguments are mostly saying “nothing rules it out”). Whereas I do think there’s a strong reason to expect weaker tendencies: AIs will be different, and on average different implies fewer properties that humans have. So aggregating these I end up concluding that unaligned AIs will be less utilitarian in expectation.
(You make a bunch of arguments for why AIs might not be as different as we expect. I agree that if you haven’t thought about those arguments before you should probably reduce your expectation of how different AIs will be. But I still think they will be quite different.)
I don’t see why it matters if AIs are more conscious than humans? I thought the relevant question we’re debating is whether they are more likely to be utilitarians. Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it’s a weak effect.
Sure, but a big difference is that no human cares about pyramid-maximization, whereas some humans are utilitarians?
(Maybe some humans do care about pyramid-maximization? I’d need to learn more about those humans before I could have any guess about whether to prefer humans over AIs.)
I would say “fairly convergent function of biologically evolved intelligence”. Evolution faced lots of constraints we don’t have in AI design. For example, cognition and learning had to be colocated in space and time (i.e. done in a single brain), whereas for AIs these can be (and are) separated. Seems very plausible that consciousness-in-the-sense-of-feeling-pleasure-and-pain is a solution needed under the former constraint but not the latter. (Maybe I’m at 20% chance that something in this vicinity is right, though that is a very made-up number.)
Here are a few (long, but high-level) comments I have before responding to a few specific points that I still disagree with:
I agree there are some weak reasons to think that humans are likely to be more utilitarian on average than unaligned AIs, for basically the reasons you talk about in your comment (I won’t express individual agreement with all the points you gave that I agree with, but you should know that I agree with many of them).
However, I do not yet see any strong reasons supporting your view. (The main argument seems to be: AIs will be different than us. You label this argument as strong but I think it is weak.) More generally, I think that if you’re making hugely consequential decisions on the basis of relatively weak intuitions (which is what I believe many effective altruists do in this context), you should be very cautious. The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold. (I think I was pretty careful in my language not to overstate the main claims.)
I suspect you may have an intuition that unaligned AIs will be very alien-like in certain crucial respects, but I predict this intuition will ultimately prove to be mistaken. In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight. These factors make it quite likely, in my view, that the resulting AI systems will exhibit utilitarian tendencies to a significant degree, even if they do not share the preferences of either their users or their creators (for instance, I would guess that GPT-4 is already more utilitarian than the average human, in a meaningful sense).
There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions. I am not persuaded by the idea that it’s probable for AIs to be entirely human-compatible on the surface while being completely alien underneath, even if we assume they do not share human preferences (e.g., the “shoggoth” meme).
I disagree with the characterization that my argument relies primarily on the notion that “you can’t rule out” the possibility of AIs being even more utilitarian than humans. In my previous comment, I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a “you can’t rule it out” type of argument, in my view.
Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions. To give another example (although it was not prominent in the post), in a footnote I alluded to the idea that AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend). This too does not seem like a mere “you cannot rule it out” argument to me, although I agree it is not the type of knockdown argument you’d expect if my thesis were stated way stronger than it actually was.
I think you may be giving humans too much credit for being slightly utilitarian. To the extent that there are indeed many humans who are genuinely obsessed with actively furthering utilitarian objectives, I agree that your argument would have more force. However, I think that this is not really what we actually observe in the real world to a large degree. I think it’s exaggerated at least; even within EA I think that’s somewhat rare.
I suspect there is a broader phenomenon at play here, whereby people (often those in the EA community) attribute a wide range of positive qualities to humans (such as the idea that our values converge upon reflection, or the idea that humans will get inherently kinder as they get wealthier) which, in my opinion, do not actually reflect the realities of the world we live in. These ideas seem (to me) to be routinely almost entirely disconnected from any empirical analysis of actual human behavior, and they sometimes appear to be more closely related to what the person making the claim wishes to be true in some kind of idealized, abstract sense (though I admit this sounds highly uncharitable).
My hypothesis is that this tendency can maybe perhaps be explained by a deeply ingrained intuition that identifies the species boundary of “humans” as being very special, in the sense that virtually all moral value is seen as originating from within this boundary, sharply distinguishing it from anything outside this boundary, and leading to an inherent suspicion of non-human entities. This would explain, for example, why there is so much focus on “human values” (and comparatively little on drawing the relevant “X values” boundary along different lines), and why many people seem to believe that human emulations would be clearly preferable to de novo AI. I do not really share this intuition myself.
My basic thoughts here are: on the one hand we have real world data points which can perhaps relevantly inform the degree to which affective empathy actually predicts utilitarianism, and on the other hand we have an intuition that it should be predictive across beings of very different types. I think the real world data points should epistemically count for more than the intuitions? More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.
Isn’t this the effect you alluded to, when you named reasons why some humans are utilitarians?
… This seems to be saying that because we are aligning AI, they will be more utilitarian. But I thought we were discussing unaligned AI?
I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
I agree with theses like “it tentatively appears that the normative value of alignment work is very uncertain, and plausibly approximately neutral, from a total utilitarian perspective”, and would go further and say that alignment work is plausibly negative from a total utilitarian perspective.
I disagree with the implied theses in statements like “I’m not very sympathetic to pausing or slowing down AI as a policy proposal.”
If you wrote a post that just said “look, we’re super uncertain about things, here’s your reminder that there are worlds in which alignment work is negative”, I’d be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to “well my main thesis was very weak and unobjectionable”.
Some more minor comments:
Well, I can believe it’s weak in some absolute sense. My claim is that it’s much stronger than all of the arguments you make put together.
This is a pretty good example of something I’d call different! You even use the adjective “inhumanly”!
To the extent your argument is that this is strong evidence that the AIs will continue to be altruistic and kind, I think I disagree, though I’ve now learned that you are imagining lots of alignment work happening when making the unaligned AIs, so maybe I’d agree depending on the specific scenario you’re imagining.
Sorry, I was being sloppy there. My actual claim is that your arguments either:
Don’t seem to bear on the question of whether AIs are more utilitarian than humans, OR
Don’t seem more compelling than the reversed versions of those arguments.
I agree that there’s a positive reason to expect AIs to have a higher density of moral value per unit of matter. I don’t see how this has any (predictable) bearing on whether AIs will be more utilitarian than humans.
Applying the reversal test:
Humans have utilitarian intuitions too, and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
I don’t especially see why one of these is stronger than the other.
(And if the AI doesn’t share any of the utilitarian intuitions, it doesn’t matter at all if it also doesn’t share the anti-utilitarian intuitions; either way it still won’t be a utilitarian.)
Applying the reversal test:
AIs might care less about reproduction than humans (a large majority of whom will reproduce at least once in their life).
Personally I find the reversed version more compelling.
Fwiw my reasoning here mostly doesn’t depend on facts about humans other than binary questions like “do humans ever display property X”, since by and large my argument is “there is quite a strong chance that unaligned AIs do not have property X at all”.
Though again this might change depending on what exactly you mean by “unaligned AI”.
(I don’t necessarily disagree with your hypotheses as applied to the broader world—they sound plausible, though it feels somewhat in conflict with the fact that EAs care about AI consciousness a decent bit—I just disagree with them as applied to me in this particular comment thread.)
I don’t buy it. The “real world data points” procedure here seems to be: take two high-level concepts (e.g. affective empathy, proclivity towards utilitarianism), draw a line between them, extrapolate way way out of distribution. I think this procedure would have a terrible track record when applied without the benefit of hindsight.
I expect my arguments based on intuitions would also have a pretty bad track record, but I do think they’d outperform the procedure above.
Yup, this is an unfortunate fact about domains where you don’t get useful real world data. That doesn’t mean you should start using useless real world data.
Yes, but I think the relevance is mostly whether or not the being feels pleasure or pain at all, rather than the magnitude with which it feels it. (Probably the magnitude matters somewhat, but not very much.)
Among humans I would weakly predict the opposite effect, that people with less pleasure-pain salience are more likely to be utilitarian (mostly due to a predicted anticorrelation with logical thinking / decoupling / systemizing nature).
Just a quick reply (I might reply more in-depth later but this is possibly the most important point):
In my post I talked about the “default” alternative to doing lots of alignment research. Do you think that if AI alignment researchers quit tomorrow, engineers would stop doing RLHF etc. to their models? That they wouldn’t train their AIs to exhibit human-like behaviors, or to be human-compatible?
It’s possible my language was misleading by giving an image of what unaligned AI looks like that isn’t actually a realistic “default” in any scenario. But when I talk about unaligned AI, I’m simply talking about AI that doesn’t share the preferences of humans (either its creator or the user). Crucially, humans are routinely misaligned in this sense. For example, employees don’t share the exact preferences of their employer (otherwise they’d have no need for a significant wage). Yet employees are still typically docile, human-compatible, and assimilated to the overall culture.
This is largely the picture I think we should imagine when we think about the “default” unaligned alternative, rather than imaging that humans will create something far more alien, far less docile, and therefore something with far less economic value.
(As an aside, I thought this distinction wasn’t worth making because I thought most readers would have already strongly internalized the idea that RLHF isn’t “real alignment work”. I suspect I was mistaken, and probably confused a ton of people.)
This overlooks my arguments in section 3, which were absolutely critical to forming my opinion here. My argument here can be summarized as follows:
The utilitarian arguments for technical alignment research seem weak, because AIs are likely to be conscious like us, and also share human moral concepts.
By contrast, technical alignment research seems clearly valuable if you care about humans who currently exist, since AIs will presumably be directly aligned to them.
However, pausing AI for alignment reasons seems pretty bad for humans who currently exist (under plausible models of the tradeoff).
I have sympathies to both utilitarianism and the view that current humans matter. The weak considerations favoring pausing AI on the utilitarian side don’t outweigh the relatively much stronger and clearer arguments against pausing for currently existing humans.
The last bullet point is a statement about my values. It is not a thesis independently of my values. I feel this was pretty explicit in the post.
I’m not just saying “there are worlds in which alignment work is negative”. I’m saying that it’s fairly plausible. I’d say greater than 30% probability. Maybe higher than 40%. This seems perfectly sufficient to establish the position, which I argued explicitly, that the alternative position is “fairly weak”.
It would be different if I was saying “look out, there’s a 10% chance you could be wrong”. I’d agree that claim would be way less interesting.
I don’t think what I said resembles a motte-and-bailey, and I suspect you just misunderstood me.
[ETA:
Part of me feels like this statement is an acknowledgement that you fundamentally agree with me. You think the argument in favor of unaligned AIs being less utilitarian than humans is weak? Wasn’t that my thesis? If you started at a prior of 50%, and then moved to 65% because of a weak argument, and then moved back to 60% because of my argument, then isn’t that completely consistent with essentially every single thing I said? OK, you felt I was saying the probability is like 50%. But 60% really isn’t far off, and it’s consistent with what I wrote (I mentioned “weak reasons” in the post). Perhaps like 80% of the reason why you disagree here is because you think my thesis was something else.
More generally I get the sense that you keep misinterpreting me as saying things that are different or stronger than what I intended. That’s reasonable given that this is a complicated and extremely nuanced topic. I’ve tried to express areas of agreement when possible, both in the post and in reply to you. But maybe you have background reasons to expect me to argue a very strong thesis about utilitarianism. As a personal statement, I’d encourage you to try to read me as saying something closer to the literal meaning of what I’m saying, rather than trying to infer what I actually believe underneath the surface.]
I have lots of other disagreements with the rest of what you wrote, although I probably won’t get around to addressing them. I mostly think we just disagree on some basic intuitions about how alien-like default unaligned AIs will actually be in the relevant senses. I also disagree with your reversal tests, because I think they’re not actually symmetric, and I think you’re omitting the best arguments for thinking that they’re asymmetric.
This, in addition to the comment I previously wrote, will have to suffice as my reply.
I was always thinking about (1), since that seems like the relevant thing. When I agreed with you that generational value drift seems worrying, that’s because it seems bad by (1). I did not mean to imply that I should act to maximize (2). I agree that if you want to act to maximize (2) then you should probably focus on preserving the current generation.
Fwiw, I reread the post again and still failed to find this idea in it, and am still pretty confused at what argument you are trying to make.
At this point I think we’re clearly failing to communicate with each other, so I’m probably going to bow out, sorry.
I’m baffled by your statement here. What did you think I was arguing when discussed whether “aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives”? The conclusion of that section was that aligned AIs are plausibly not more likely to have such a preference, and therefore, human utilitarian preferences here are not “unusually high compared to other possibilities” (the relevant alternative possibility here being unaligned AI).
This was a central part of my post that I discussed at length. The idea that unaligned AIs might be similarly utilitarian or even more so, compared to humans, was a crucial part of my argument. If indeed unaligned AIs are very likely to be less utilitarian than humans, then much of my argument in the first section collapses, which I explicitly acknowledged.
I consider your statement here to be a valuable data point about how clear my writing was and how likely I am to get my ideas across to others who read the post. That said, I believe I discussed this point more-or-less thoroughly.
ETA: Claude 3′s summary of this argument in my post:
I agree it’s clear that you claim that unaligned AIs are plausibly comparably utilitarian as humans, maybe more.
What I didn’t find was discussion of how contingent utilitarianism is in humans.
Though actually rereading your comment (which I should have done in addition to reading the post) I realize I completely misunderstood what you meant by “contingent”, which explains why I didn’t find it in the post (I thought of it as meaning “historically contingent”). Sorry for the misunderstanding.
Let me backtrack like 5 comments and retry again.
If I had to pick a second consideration I’d go with:
After millions of years of life (or much more) and massive amounts of cognitive enhancement, the way post-humans might act isn’t clearly well predicted by just looking at their current behavior.
Again, I’d like to stress that my claim is:
One additional meta-level point which I think is important: I think that existing writeups of why human control would have more moral value than unaligned AI control from a longtermist perspective are relatively weak and often specific writeups are highly flawed. (For some discussion of flaws, see this sequence.)
I just think that this write-up misses what seem to me to be key considerations, I’m not claiming that existing work settles the question or is even robust at all.
And it’s somewhat surprising and embarassing that this is the state of the current work given that longtermism is reasonably common and arguments for working on AI x-risk from a longtermist perspective are also common.
I feel I did consider this argument in detail, including several considerations that touch on the arguments you gave. However, I primarily wanted to survey the main points that people have previously given me, rather than focusing heavily on a small set of arguments that someone like you might consider to be the strongest ones. And I agree that I may have missed some important considerations in this post.
In regards to your specific points, I generally find your arguments underspecified because, while reading them, it is difficult for me to identify a concrete mechanism for why alignment with human preferences creates astronomically more value from a total utilitarian perspective relative to the alternative. As it is, you seem to have a lot of confidence that human values, upon reflection, would converge onto values that would be far better in expectation than the alternative. However, I’m not a moral realist, and by comparison to you, I think I don’t have much faith in the value of moral reflection, absent additional arguments.
My speculative guess is that part of this argument comes from simply defining “human preferences” as aligned with utilitarian objectives. For example, you seem to think that aligning AIs would help empower the fraction of humans who are utilitarians, or at least would become utilitarians on reflection. But as I argued in the post, the vast majority of humans are not total utilitarians, and indeed, anti-total utilitarian moral intuitions are quite common among humans, which would act against the creation of large amounts of utilitarian value in an aligned scenario.
These are my general thoughts on what you wrote, although I admit I have not responded in detail to any of your specific arguments, and I think you did reveal a genuine blindspot in the arguments I gave. I may write a comment at some future point that considers your comment more thoroughly.
I’m assuming some level of moral-quasi realism: I care about what I would think is good after reflecting on the situation for a long time and becoming much smarter.
For more on this perspective consider: this post by Holden. I think there is a bunch of other discussion elsewhere from Paul Christiano and Joe Carlsmith, but I can’t find posts immediately.
I think the case for being a moral-quasi realist is very strong and depends on very few claims.
Not exactly, I’m just defining “the good” as something like “what I would think was good after following a good reflection process which doesn’t go off the rails in an intuitive sense”. (Aka moral-quasi realism.)
I’m not certain that after reflection I would end up at something which is that well described as utilitarian. Something vaguely in the ball park seems plausible though.
A reasonable fraction of my view is that many of the moral intuitions of humans might mostly be biases which end up not being that important if people decide to thoughtfully reflect. I predict that humans converge more after reflection and becoming much, much smarter. I don’t know exactly what humans converge towards, but it seems likely that I converge toward a cluster which benefits from copious amounts of resources and which has reasonable support among the things which humans think on reflection.
Depending on the structure of this meta-ethical view, I feel like you should be relatively happy to let unaligned AIs do the reflection for you in many plausible circumstances. The intuition here is that if you are happy to defer your reflection to other humans, such as future humans who will replace us in the future, then you should potentially also be open to deferring your reflection to a large range of potential other beings, including AIs who might initially not share human preferences, but would converge to the same ethical views that we’d converge to.
In other words, in contrast to a hardcore moral anti-realist (such as myself) who doesn’t value moral reflection much, you seem happier to defer this reflection process to beings who don’t share your consumption or current ethical preferences. But you seem to think it’s OK to defer to humans but not unaligned AIs, implicitly drawing a moral distinction on the basis of species. Whereas I’m concerned that if I die and get replaced by either humans or AIs, my goals will not be furthered, including in the very long-run.
What is it about the human species exactly that makes you happy to defer your values to other members of that species?
I think I have a difficult time fully understanding your view because I think it’s a little underspecified. In my view, there seem to be a vast number of different ways that one can “reflect”, and intuitively I don’t think all (or even most) of these processes will converge to roughly the same place. Can you give me intuitions for why you hold this meta-ethical view? Perhaps you can also be more precise about what you see as the central claims of moral quasi-realism.
I’m certainly happy if we get to the same place. I think I have feel less good about the view the more contingent it is.
I mean, I certainly think you lose some value from it being other humans. My guess is that you lose more like 5-20x of the value from my perspective with humans than like 1000x and that this 5-20x of the value lost is more like 20-100x for unaligned AI.
I think my views about what I converge to are distinct about my views on quasi-realism. I think a weak notion of quasi-realism is extremely intuitive: you would do better things if you thought more about what would be good (at least relatively to the current returns, eventually returns to thinking would saturate). Because e.g., there are interesting empirical facts (where did my current biases come from evolutionarily? what are brains doing?) I’m not claiming that quasi-realism implies my conclusions, just that it’s an important part of where I’m coming from.
I separately think that reflection and getting smarter are likely to cause convergence due to a variety of broad intuitions and some vague historical analysis. I’m not hugely confident in this, but I’m confident enough to think the expect value looks pretty juicy.
Thanks for writing this.
I disagree with quite a few points in the total utilitarianism section, but zooming out slightly, I think that total utilitarians should generally still support alignment work (and potentially an AI pause/slow down) to preserve option value. If it turns out that AIs are moral patients and that it would be good for them to spread into the universe optimising for values that don’t look particularly human, we can still (in principle) do that. This is compatible with thinking that alignment from a total utilitarian perspective is ~neutral—but it’s not clear that you agree with this from the post.
I think the problem with this framing is that it privileges a particular way of thinking about option value that prioritizes the values of the human species in a way I find arbitrary.
In my opinion, the choice before the current generation is not whether to delay replacement by a different form of life, but rather to choose our method of replacement: we can either die from old age over decades and be replaced by the next generation of humans, or we can develop advanced AI and risk being replaced by them, but also potentially live much longer and empower our current generation’s values.
Deciding to delay AI is not a neutral choice. It only really looks like we’re preserving option value in the first case if you think there’s something great about the values of the human species. But then if you think that the human species is special, I think these arguments are adequately considered in the first and second sections of my post.
Hmm, maybe I’ll try to clarify what I think you’re arguing as I predict it will be confusing to caleb and bystanders. The way I would have put this is:
It only preserves option value from your perspective to the extent that you think humanity overall[1] will have a similar perspective as you and will make resonable choices. Matthew seems to think that humanity will use ~all of the resources on (directly worthless?) economic consumption such that the main source of value (from a longtermist, scope sensitive, utilitarian-ish perspective) will be from the minds of the laborers that produce the goods for this consumption. Thus, there isn’t any option value as almost all the action is coming from indirect value rather than from people trying to produce value.
I disagree strongly with Matthew on this view about where the value will come from in expectation insofar as that is an accurate interpretation. (I elaborate on why in this comment.) I’m not certain about this being a correct interpretation of Matthew’s views, but it at least seems heavily implied by:
Really, whoever controls resources under worlds where “humanity” keeps control.
I agree with your first sentence as a summary of my view.
The second sentence is also roughly accurate[ETA: see comment below for why I am no longer endorsing this], but I do not consider it to be a complete summary of the argument I gave in the post. I gave additional reasons for thinking that the values of the human species are not special from a total utilitarian perspective. This included the point that humans are largely not utilitarians, and in fact frequently have intuitions that would act against the recommendations of utilitarianism if their preferences were empowered. I elaborated substantially on this point in the post.On second thought, regarding the second sentence, I think I want to take back my endorsement. I don’t necessarily think the main source of value will come from the minds of AIs who labor, although I find this idea plausible depending on the exact scenario. I don’t really think I have a strong opinion about this question, and I didn’t see my argument as resting on it. And so I’d really prefer it not be seen as part of my argument (and I did not generally try to argue this in the post).
Really, my main point was that I don’t actually see much of a difference between AI consumption and human consumption, from a utilitarian perspective. Yet, when thinking about what has moral value in the world, I think focusing on consumption in both cases is generally correct. This includes considerations related to incidental utility that comes as a byproduct from consumption, but the “incidental” part here is not a core part of what I’m arguing.
>I think the problem with this framing is that it privileges a particular way of thinking about option value that prioritizes the values of the human species in a way I find arbitrary.
I think it’s in the same category of “don’t do crime for utilitarian reasons”? Like, if you are not seeing that (trans-)humans are preferable, you are at odds with lots of people who do see it. (and, like, with me personally) Not moustache twirling level of villaining, but you know… you need to be careful with this stuff. You probably don’t want to be that part of ea that is literally plotting downfall of human civilization
I feel like this goes against the principle of not leaving your footprint on the future, no?
Like, a large part of what I believe to be the danger with AI is that we don’t have any reflective framework for morality. I also don’t believe the standard path for AGI is one of moral reflection. This would then to me say that we leave the value of the future up to market dynamics and this doesn’t seem good with all the traps there are in such a situation? (Moloch for example)
If we want a shot at a long reflection or similar, I don’t think full sending AGI is the best thing to do.
A major reason that I got into longtermism in the first place is that I’m quite interested in “leaving a footprint” on the future (albeit a good one). In other words, I’m not sure I understand the intuition for why we wouldn’t deliberately try to leave our footprints on the future, if we want to have an impact. But perhaps I’m misunderstanding the nature of this metaphor. Can you elaborate?
I think it’s worth being more specific about why you think AGI will not do moral reflection? In the post, I carefully consider arguments about whether future AIs will be alien-like and have morally arbitrary goals, in a respect that you seem to be imagining. I think it’s possible that I addressed some of the intuitions behind your argument here.
I guess I felt that a lot of the post was arguing under a frame of utilitarianism which is generally fair I think. When it comes to “not leaving a footprint on the future” what I’m referring to is epistemic humility about the correct moral theories. I’m quite uncertain myself about what is correct when it comes to morality with extra weight on utilitarianism. From this, we should be worried about being wrong and therefore try our best to not lock in whatever we’re currently thinking. (The classic example being if we did this 200 years ago we might still have slaves in the future)
I’m a believer that virtue ethics and deontology are imperfect information approximations of utilitarianism. Like Kant’s categorical imperative is a way of looking at the long-term future and asking, how do we optimise society to be the best that it can be?
I guess a core crux here for me is that it seems like you’re arguing a bit for naive utilitarianism here. I actually don’t really believe the idea that we will have the AGI follow the VNM-axioms that is being fully rational. I think it will be an internal dynamic system that are weighing for different things that it wants and that it won’t fully maximise utility because it won’t be internally aligned. Therefore we need to get it right or we’re going to have weird and idiosyncratic values that are not optimal for the long-term future of the world.
I hope that makes sense, I liked your post in general.
The “footprints on the future” thing could be referencing this post.
(Edit: to be clear, this link is not an endorsement.)
I see. After briefly skimming that post, I think I pretty strongly disagree with just about every major point in it (along with many of its empirical background assumptions), although admittedly I did not spend much time reading through it. If someone thinks that post provides good reasons to doubt the arguments in my post, I’d likely be happy to discuss the specific ideas within it in more detail.
Yes, I was on my phone, and you can’t link things there easily; that was what I was referring to.
Executive summary: From a total utilitarian perspective, the value of AI alignment work is unclear and plausibly neutral, while from a human preservationist or near-termist view, alignment is clearly valuable but significantly delaying AI is more questionable.
Key points:
Unaligned AIs may be just as likely to be conscious and create moral value as aligned AIs, so alignment work is not clearly valuable from a total utilitarian view.
Human moral preferences are a mix of utilitarian and anti-utilitarian intuitions, so empowering them may not be better than an unaligned AI scenario by utilitarian lights.
From a human preservationist view, alignment is clearly valuable since it would help ensure human survival, but this view rests on speciesist foundations.
A near-termist view focused on benefits to people alive today would value alignment but not significantly delaying AI, since that could deprive people of potentially massive gains in wealth and longevity.
Arguments for delaying AI to reduce existential risk often conflate the risk of human extinction with the risk of human replacement by AIs, which are distinct from a utilitarian perspective.
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.
Great post, Matthew! Misaligned AI not being clearly bad is one of the reasons why I have been moving away from AI safety to animal welfare as the most promising cause area. In my mind, advanced AI would ideally be aligned with expected total hedonistic utilitarianism.
Thank you for writing this. I broadly agree with the perspective and find it frustrating how often it’s dismissed based on (what seem to me) somewhat-shaky assumptions.
A few thoughts, mainly on the section on total utilitarianism:
1. Regarding why people tend to assume unaligned AIs won’t innately have any value, or won’t be conscious: my impression is this is largely due to the “intelligence as optimisation process” model that Eliezer advanced. Specifically, that in this model, the key ability humans have that enables us to be so successful is our ability to optimise for goals; whereas mind features we like, such as consciousness, joy, curiosity, friendship, and so on are largely seen as being outside this optimisation ability, and are instead the terminal values we optimise for. (Also that none of the technology we have so far built has really affected this core optimisation ability, so once we do finally build an artificial optimiser it could very well quickly become much more powerful than us, since unlike us it might be able to improve its optimisation ability.)
I think people who buy this model will tend not to be moved much by observations like consciousness having evolved multiple times, as they’d think: sure, but why should I expect that consciousness is part of the optimisation process bit of our minds, specifically? Ditto for other mind features, and also for predictions that AIs will be far more varied than humans — there just isn’t much scope for variety or detail in the process of doing optimisation. You use the phrase “AI civilisation” a few times; my sense is that most people who expect disaster from unaligned AI would say their vision of this outcome is not well-described as a “civilisation” at all.
2. I agree with you that if the above model is wrong (which I expect it is), and AIs really will be conscious, varied, and form a civilisation rather than being a unified unconscious optimiser, then there is some reason to think their consumption will amount to something like “conscious preference satisfaction”, since a big split between how they function when producing vs consuming seems unlikely (even though it’s logically possible).
I’m a bit surprised though by your focus (as you’ve elaborated on in the comments) on consumption rather than production. For one thing, I’d expect production to amount to a far greater fraction of AIs’ experience-time than consumption, I guess on the basis that production enables more subsequent production (or consumption), whereas consumption doesn’t, it just burns resources.
Also, you mentioned concerns about factory farms and wild animal suffering. These seem to me describable as “experiences during production” — do you not have similar concerns regarding AIs’ productive activities? Admittedly pain might not be very useful for AIs, as plausibly if you’re smart enough to see the effects on your survival of different actions, then you don’t need such a crude motivator — even humans trying very hard to achieve goals seem to mostly avoid pain while doing so, rather than using it to motivate themselves. But emotions like fear and stress seem to me plausibly useful for smart minds, and I’d not be surprised if they were common in an AI civilisation in a world where the “intelligence as optimisation process” model is not true. Do you disagree, or do you just think they won’t spend much time producing relative to consuming, or something else?
(To be clear, I agree this second concern has very little relation to what’s usually termed “AI alignment”, but it’s the concern re: an AI future that I find most convincing, and I’m curious on your thoughts on it in the context of the total utilitarian perspective.)
Thanks. I agree with what you have to say about effective altruists dismissing this perspective based on what seem to be shaky assumptions. To be a bit blunt, I generally find that, while effective altruists are often open to many types of criticism, the community is still fairly reluctant to engage deeply with some ideas that challenge their foundational assumptions. This is one of those ideas.
But I’m happy to see this post is receiving net-positive upvotes, despite the disagreement. :)
This is really useful in that it examines critically what I think of as the ‘orthodox view’: alignment is good because it ‘allows humans to preserve control over the future’. This view feels fundamental but underexamined, in much of the EA/alignment world (with notable exceptions: Rich Sutton, Robin Hanson, Joscha Bach who seem species-agnostic; Paul Christiano has also fleshed out his position e.g. this part of a Dwarkesh Patel podcast).
A couple of points I wasn’t sure I understood/agreed with FWIW:
a) A relatively minor one is
I’m not sure about about this symmetry—I can imagine an LLM (~GPT-5 class) integrated into a nuclear/military decision-making system that could cause catastrophic death/suffering (millions/billions of immediate/secondary deaths, massive technological setback, albeit not literal extinction). I’m assuming the point doesn’t hinge on literal extinction.
b) Regarding calebp’s comment on option value: I agree most option value discussion (doesn’t seem to be much outside Bostrom and the s-risk discourse) assumes continuation of the human species, but I wonder if there is room for a more cosmopolitan framing: ‘Humans are our only example of an advanced technological civilisation, that might be on the verge of a step change in their evolution. The impact of this evolutionary step-change on the future can arguably be (on balance) good (definition of “good” tbd). The “option value” we are trying to preserve is less the existence of humans per-se, but rather the possibility of such an evolution happening at all. Put another way, we don’t an to prematurely introduce an unaligned or misaligned AI (perhaps a weak one) that causes extinction, a bad lock-in, or prevents emergence of more capable AIs that could have achieved this evolutionary transition.’
In other words, the option value is not over the number of human lives (or economic value) but rather over the possible trajectories of the future...this does not seem particularly species-specific. It just says that we should be careful not to throw these futures away.
c) point (b) hinges on why human evolution is ‘good’ in any broad or inclusive sense (outside of letting current and near-current generations live wealthier, longer lives, if indeed those are good things).
In order to answer this, it feels like we need some way of defining value ‘from the point of view of the universe’. That particular phrase is a Sidgwick/Singer thing, and I’m not sure it is directly applicable in this context (like similar phrases e.g. Nagel’s ‘view from nowhere’), but without this it is very hard to talk about non-species based notions of value (i.e. standard utilitarianism, deontological/virtue approaches all basically rely on human on animal beings).
My candidate for this ‘cosmic value’ is something like created complexity (which can be physical or not, and can include things that are not obviously economically/militarily/reproductively valuable like art). This includes having trillions of diverse computing entities (human or otherwise).
This is obviously pretty hand-wavey, but I’d be interested in talking to anyone with views (it’s basically my PhD :-)