Matthew_Barnett comments on Analyzing the moral value of unaligned AIs

Matthew_Barnett 9 Apr 2024 16:42 UTC
4 points
1 ∶ 0

Both seems negligible relative to the expected amount of compute spent on optimized goodness in my view.

Both will presumably be forms of consumption, which could be in the form of compute spent on optimized goodness. You seem to think compute will only be used for optimized goodness for non-consumption purposes (which is why you care about the small fraction of resources spent on altruism) and I’m saying I don’t see a strong case for that.
- Ryan Greenblatt 9 Apr 2024 17:10 UTC
  1 point
  0 ∶ 0
  Parent
  why you care about the small fraction of resources spent on altruism
  I’m also not sold it’s that small.
  Regardless, doesn’t seem like we’re making progresss here.
  - Matthew_Barnett 9 Apr 2024 17:19 UTC
    3 points
    0 ∶ 0
    Parent
    
    Regardless, doesn’t seem like we’re making progresss here.
    
    You have no obligation to reply, of course, but I think we’d achieve more progress if you clarified your argument in a concise format that explicitly outlines the assumptions and conclusion.
    
    As far as I can gather, your argument seems to be a mix of assumptions about humans being more likely to optimize for goodness (why?), partly because they’re more inclined to reflect (why?), which will lead them to allocate more resources towards altruism rather than selfish consumption (why is that significant?). Without understanding how your argument connects to mine, it’s challenging to move forward on resolving our mutual disagreement.
    - Rohin Shah 29 Apr 2024 7:35 UTC
      9 points
      2 ∶ 0
      Parent
      Fwiw I had a similar reaction as Ryan.
      My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don’t know) rather than humans (where we know some are total utilitarians). In your taxonomy this would be “humans are more likely to optimize for goodness”.
      Let’s make a toy model compatible with your position:
      A short summary of my position is that unaligned AIs could be even more utilitarian than humans are, and this doesn’t seem particularly unlikely either given that (1) humans are largely not utilitarians themselves, (2) consciousness doesn’t seem special or rare, so it’s likely that unaligned AIs could care about it too, and (3) unaligned AIs will be trained on human data, so they’ll likely share our high-level concepts about morality even if not our exact preferences.
      Let’s say that there are a million values that one could have with “humanity’s high-level concepts about morality”, one of which is “Rohin’s values”.
      For (3), we’ll say that both unaligned AI values and human values are a subset sampled uniformly at random from these million values (all values in the subset weighted equally, for simplicity).
      For (1), we’ll say that the sampled human values include “Rohin’s values”, but only as one element in the set of sampled human values.
      I won’t make any special distinction about consciousness so (2) won’t matter.
      In this toy model you’d expect aligned AI to put ¹⁄_1,000 weight on “Rohin’s values”, whereas unaligned AI puts ¹⁄_1,000,000 weight in expectation on “Rohin’s values” (if the unaligned AI has S values, then there’s an S/1,000,000 probability of it containing “Rohin’s values”, and it is weighted 1/S if present). So aligned AI looks a lot better.
      More generally, ceteris paribus, keeping values intact prevents drift and so looks strongly positive from the point of view of the original values, relative to resampling values “from scratch”.
      (Feel free to replace “Rohin’s values” with “utilitarianism” if you want to make the utilitarianism version of this argument.)
      Imo basically everything that Ryan says in this comment thread is a countercounterargument to a counterargument to this basic argument. E.g. someone might say “oh it doesn’t matter which values you’re optimizing for, all of the value is in the subjective experience of the AIs that are laboring to build new chips, not in the consumption of the new chips” and the rebuttal to that is “Value can be extremely dense in computation relative to the density of value from AIs used for economic activity (instead of value).”
      - Matthew_Barnett 29 Apr 2024 9:00 UTC
        4 points
        0 ∶ 0
        Parent
        
        My framing would be: it seems pretty wild to think that total utilitarian values would be better served by unaligned AIs (whose values we don’t know) rather than humans (where we know some are total utilitarians).
        
        I’m curious: Does your reaction here similarly apply to ordinary generational replacement as well?
        
        Let me try to explain what I’m asking.
        
        We have a set of humans who exist right now. We know that some of them are utilitarians. At least one of them shares “Rohin’s values”. Similar to unaligned AIs, we don’t know the values of the next generation of humans, although presumably they will continue to share our high-level moral concepts since they are human and will be raised in our culture. After the current generation of humans die, the next generation could have different moral values.
        
        As far as I can tell, the situation with regards to the next generation of humans is analogous to unaligned AI in the basic sense I’ve just laid out (mirroring the part of your comment I quoted). So, in light of that, would you similarly say that it’s “pretty wild to think that total utilitarian values would be better served by a future generation of humans”?
        
        One possible answer here: “I’m not very worried about generational replacement causing moral values to get worse since the next generation will still be human.” But if this is your answer, then you seem to be positing that our moral values are genetic and innate, rather than cultural, which is pretty bold, and presumably merits a defense. This position is IMO largely empirically ungrounded, although it depends on what you mean by “moral values”.
        
        Another possible answer is: “No, I’m not worried about generational replacement because we’ve seen a lot of human generations already and we have lots of empirical data on how values change over time with humans. AI could be completely different.” This would be a reasonable response, but as a matter of empirical fact, utilitarianism did not really culturally exist 500 or 1000 years ago. This indicates that it’s plausibly quite fragile, in a similar way it might also be with AI. Of course, values drift more slowly with ordinary generational replacement compared to AI, but the phenomenon still seems roughly pretty similar. So perhaps you should care about ordinary value drift almost as much as you’d care about unaligned AIs.
        
        If you do worry about generational value drift in the strong sense I’ve just described, I’d argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).
        Rohin Shah 29 Apr 2024 22:10 UTC
        6 points
        1 ∶ 0
        Parent
        To the extent that future generations would have pretty different values than me, like “the only glory is in war and it is your duty to enslave your foes”, along with the ability to enact their values on the reachable universe, in fact that would seem pretty bad to me.
        However, I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI. (This doesn’t require values to be innate, values can be adaptive for many human cultures but not for AI cultures.) So I would be less worried about generational value drift (but not completely unworried).
        In addition, this worry is tempered even more by the possibility that values / culture will be set much more deliberately in the nearish future, rather than via culture, simply because with an intelligence explosion that becomes more possible to do than it is today.
        If you do worry about generational value drift in the strong sense I’ve just described, I’d argue this should cause you to largely adopt something close to position (3) that I outlined in the post, i.e. the view that what matters is preserving the lives and preferences of people who currently exist (rather than the species of biological humans in the abstract).
        Huh? I feel very confused about this, even if we grant the premise. Isn’t the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren’t going to enact large-scale change? Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?
        Matthew_Barnett 29 Apr 2024 23:11 UTC
        2 points
        0 ∶ 0
        Parent
        I expect the correlation between my values and future generation values is higher than the correlation between my values and unaligned AI values, because I share a lot more background with future humans than with unaligned AI.
        To clarify, I think it’s a reasonable heuristic that, if you want to preserve the values of the present generation, you should try to minimize changes to the world and enforce some sort of stasis. This could include not building AI. However, I believe you may be glossing over the distinction between: (1) the values currently held by existing humans, and (2) a more cosmopolitan, utilitarian ethical value system.
        We can imagine a wide variety of changes to the world that would result in a vast changes to (1) without necessarily being bad according to (2). For example:
        We could start doing genetic engineering of humans.
        We could upload humans onto computers.
        A human-level, but conscious, alien species could immigrate to Earth via a portal.
        In each scenario, I agree with your intuition that “the correlation between my values and future humans is higher than the correlation between my values and X-values, because I share much more background with future humans than with X”, where X represents the forces at play in each scenario. However, I don’t think it’s clear that the resulting change to the world would be net negative from the perspective of an impartial, non-speciesist utilitarian framework.
        In other words, while you’re introducing something less similar to us than future human generations in each scenario, it’s far from obvious whether the outcome will be relatively worse according to utilitarianism.
        Based on your toy model, my guess is that your underlying intuition is something like, “The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it’s very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%.”
        If this captures your view, my main response is that it seems to assume a much narrower and more fragile conception of “cosmopolitan utilitarian values” than the version I envision, and it’s not a moral perspective I currently find compelling.
        Conversely, if you’re imagining a highly contingent, fragile form of utilitarianism that regards the world as far worse under a wide range of changes, then I’d argue we also shouldn’t expect future humans to robustly hold such values. This makes it harder to claim the problem of value drift is much worse for AI compared to other forms of drift, since both are simply ways the state of the world could change, which was the point of my previous comment.
        I feel very confused about this, even if we grant the premise. Isn’t the primary implication of the premise to try to prevent generational value drift? Why am I only prioritizing people with similar values, instead of prioritizing all people who aren’t going to enact large-scale change?
        I’m not sure I understand which part of the idea you’re confused about. The idea was simply:
        Let’s say that your view is that generational value drift is very risky, because future generations could have much worse values from the ones you care about (relative to the current generation)
        In that case, you should try to do what you can to stop generational value drift
        One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
        This would look quite similar to the moral view in which you’re trying to protect the current generation of humans, which was the third moral view I discussed in the post.
        Why would the priority be on current people, instead of people with similar values (there are lots of future people who have more similar values to me than many current people)?
        The reason the priority would be on current people rather than those with similar values is that, by assumption, future generations will have different values due to value drift. Therefore, the ~best strategy to preserve current values would be to preserve existing people. This seems relatively straightforward to me, although one could certainly question the premise of the argument itself.
        Let me know if any part of the simplified argument I’ve given remains unclear or confusing.
        Rohin Shah 30 Apr 2024 6:13 UTC
        4 points
        0 ∶ 0
        Parent
        Based on your toy model, my guess is that your underlying intuition is something like, “The fact that a tiny fraction of humans are utilitarian is contingent. If we re-rolled the dice, and sampled from the space of all possible human values again (i.e., the set of values consistent with high-level human moral concepts), it’s very likely that <<1% of the world would be utilitarian, rather than the current (say) 1%.”
        No, this was purely to show why, from the perspective of someone with values, re-rolling those values would seem bad, as opposed to keeping the values the same, all else equal. In any specific scenario, (a) all else won’t be equal, and (b) the actual amount of worry depends on the correlation between current values and re-rolled values.
        The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that’s what you were arguing (e.g. when you say things like “humans are largely not utilitarians themselves”). I don’t have a strong view on this and I don’t think it really matters for the positions I take.
        For example:
        We could start doing genetic engineering of humans.
        We could upload humans onto computers.
        A human-level, but conscious, alien species could immigrate to Earth via a portal.
        The first two seem broadly fine, because I still expect high correlation between values. (Partly because I think that cosmopolitan utilitarian-ish values aren’t fragile.)
        The last one seems more worrying than human-level unaligned AI (more because we have less control over them) but less worrying than unaligned AI in general (since the aliens aren’t superintelligent).
        Note I’ve barely thought about these scenarios, so I could easily imagine changing my mind significantly on these takes. (Though I’d be surprised if it got to the point where I thought it was comparable to unaligned AI, in how much the values could stop correlating with mine.)
        One way of stopping generational value drift is to try to prevent the current generation of humans from dying, and/or having their preferences die out
        It seems way better to simply try to spread your values? It’d be pretty wild if the EA field-builders said “the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours”.
        Matthew_Barnett 30 Apr 2024 7:00 UTC
        2 points
        0 ∶ 0
        Parent
        The main reason I made utilitarianism a contingent aspect of human values in the toy model is because I thought that’s what you were arguing (e.g. when you say things like “humans are largely not utilitarians themselves”).
        I think there may have been a misunderstanding regarding the main point I was trying to convey. In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.
        When you quoted me saying “humans are largely not utilitarians themselves,” I intended this point to support the idea that our current rough level of utilitarianism is not contingent, rather than the opposite claim. In other words, I meant that the fact that humans are not highly utilitarian suggests that this level of utilitarianism is not unusual or contingent upon specific circumstances, and we might expect other intelligent beings, such as aliens or AIs, to exhibit similar, or even greater, levels of utilitarianism.
        Compare to the hypothetical argument: humans aren’t very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.
        (This argument is analogous because pyramids are simple structures that lots of different civilizations would likely stumble upon. Similarly, I think “try to create lots of good conscious experiences” is also a fairly simple directive, if indeed aliens/AIs/whatever are actually conscious themselves.)
        I don’t have a strong view on this and I don’t think it really matters for the positions I take.
        I think the question of whether utilitarianism is contingent or not matters significantly for our disagreement, particularly if you are challenging my post or the thesis I presented in the first section. If you are very uncertain about whether utilitarianism is contingent in the sense that is relevant to this discussion, then I believe that aligns with one of the main points I made in that section of my post.
        Specifically, I argued that the degree to which utilitarianism is contingent vs. common among a wide range of intelligent beings is highly uncertain and unclear, and this uncertainty is an important consideration when thinking about the values and behaviors of advanced AI systems from a utilitarian perspective. So, if you are expressing strong uncertainty on this matter, that seems to support one of my central claims in that part of the post.
        (My view, as expressed in the post, is that unaligned AIs have highly unclear utilitarian value but there’s a plausible scenario where they are roughly net-neutral, and indeed I think there’s a plausible scenario where they are even more valuable than humans, from a utilitarian point of view.)
        It seems way better to simply try to spread your values? It’d be pretty wild if the EA field-builders said “the best way to build EA, taking into account the long-term future, is to prevent the current generation of humans from dying, because their preferences are most similar to ours”.
        I think this part of your comment plausibly confuses two separate points:
        How to best further your own values
        How to best further the values of the current generation.
        I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1). That said, to the extent that your values simply mirror the values of your generation, I don’t understand your argument for why trying to spread your values would be “way better” than trying to preserve the current generation. Perhaps you can elaborate?
        Rohin Shah 30 Apr 2024 21:20 UTC
        4 points
        0 ∶ 0
        Parent
        Given my new understanding of the meaning of “contingent” here, I’d say my claims are:
        I’m unsure about how contingent the development of utilitarianism in humans was. It seems quite plausible that it was not very historically contingent. I agree my toy model does not accurately capture my views on the contingency of total utilitarianism.
        I’m also unsure how contingent it is for unaligned AI, but aggregating over my uncertainty suggests more contingent.
        One way to think about this is to ask: why are any humans utilitarians? To the extent it’s for reasons that don’t apply to unaligned AI systems, I think you should feel like it is less likely for unaligned AI systems to be utilitarians. So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the “true explanation” will have to appeal to more than simplicity, and the additional features this “true explanation” will appeal to are very likely to differ between humans and AIs.)
        Compare to the hypothetical argument: humans aren’t very obsessed with building pyramids --> our current level of obsession with pyramid building is probably not unusual, in the sense that you might easily expect aliens/AIs to be similarly obsessed with building pyramids, or perhaps even more obsessed.
        Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was “maybe pyramids were especially easy to build historically”.)
        General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect “my values” (which I’m uncertain about and may not reflect utilitarianism) than a world with unaligned AI. But I’m happy to continue talking about the implications for utilitarians.
        Matthew_Barnett 30 Apr 2024 22:23 UTC
        4 points
        0 ∶ 0
        Parent
        So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the “true explanation” will have to appeal to more than simplicity, and the additional features this “true explanation” will appeal to are very likely to differ between humans and AIs.)
        Thanks for trying to better understand my views. I appreciate you clearly stating your reasoning in this comment, as it makes it easier for me to directly address your points and explain where I disagree.
        You argued that feeling pleasure and pain, as well as having empathy, are important factors in explaining why some humans are utilitarians. You suggest that to the extent these reasons for being utilitarian don’t apply to unaligned AIs, we should expect it to be less likely for them to be utilitarians compared to humans.
        However, a key part of the first section of my original post was about whether unaligned AIs are likely to be conscious—which for the purpose of this discussion, seems roughly equivalent to whether they will feel pleasure and pain. I concluded that unaligned AIs are likely to be conscious for several reasons:
        Consciousness seems to be a fairly convergent function of intelligence, as evidenced by the fact that octopuses are widely accepted to be conscious despite sharing almost no homologous neural structures with humans. This suggests consciousness arises somewhat robustly in sufficiently sophisticated cognitive systems.
        Leading theories of consciousness from philosophy and cognitive science don’t appear to predict that consciousness will be rare or unique to biological organisms. Instead, they tend to define consciousness in terms of information processing properties that AIs could plausibly share.
        Unaligned AIs will likely be trained in environments quite similar to those that gave rise to human and animal consciousness—for instance, they will be trained on human cultural data and, in the case of robots, will interact with physical environments. The evolutionary and developmental pressures that gave rise to consciousness in biological organisms would thus plausibly apply to AIs as well.
        So in short, I believe unaligned AIs are likely to feel pleasure and pain, for roughly the reasons I think humans and animals do. Their consciousness would not be an improbable or fragile outcome, but more likely a robust product of being a highly sophisticated intelligent agent trained in environments similar to our own.
        I did not directly address whether unaligned AIs would have empathy, though I find this fairly likely as well. At the very least, I expect they would have cognitive empathy—the ability to model and predict the experiences of others—as this is clearly instrumentally useful. They may lack affective empathy, i.e. the ability to share the emotions of others, which I agree could be important here. But it’s notable that explicit utilitarianism seems, anecdotally, to be more common among people on the autism spectrum, who are characterized as having reduced affective empathy. This suggests affective empathy may not be strongly predictive of utilitarian motivations.
        Let’s say you concede the above points and say: “OK I concede that unaligned AIs might be conscious. But that’s not at all assured. Unaligned AIs might only be 70% likely to be conscious, whereas I’m 100% certain that humans are conscious. So there’s still a huge gap between the expected value of unaligned AIs vs. humans under total utilitarianism, in a way that overwhelmingly favors humans.”
        However, this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans, or have an even stronger tendency towards utilitarian motivations. This could be the case if, for instance, AIs are more cognitively sophisticated than humans or are more efficiently designed in a morally relevant sense. Given that the vast majority of humans do not seem to be highly motivated by utilitarian considerations, it doesn’t seem like an unlikely possibility that AIs could exceed our utilitarian inclinations. Nor does it seem particularly unlikely that their minds could have a higher density of moral value per unit of energy, or matter.
        We could similarly examine this argument in the context of considering other potential large changes to the world, such as creating human emulations, genetically engineered humans, or bringing back Neanderthals from extinction. In each case, I do not think the (presumably small) probability that the entities we are adding to the world are not conscious constitutes a knockdown argument against the idea that they would add comparable utilitarian value to the world compared to humans. The main reason is because these entities could be even better by utilitarian lights than humans are.
        Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was “maybe pyramids were especially easy to build historically”.)
        This seems minor, but I think the relevant claim is whether AIs would build more pyramids going forward, compared to humans, rather than comparing to historical levels of pyramid construction among humans. If pyramids were easy to build historically, but this fact is no longer relevant, then that seems true now for both humans and AIs, into the foreseeable future. As a consequence it’s hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization. By essentially the same arguments I gave above about utilitarianism, I don’t think there’s a strong argument for thinking that aligning AIs is good from the perspective of pyramid maximization.
        General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect “my values”
        This makes sense to me, but it’s hard to say much about what’s good from the perspective of your values if I don’t know what those values are. I focused on total utilitarianism in the post because it’s probably the most influential moral theory in EA, and it’s the explicit theory used in Nick Bostrom’s influential article Astronomical Waste, and this post was partly intended as a reply to that article (see the last few paragraphs of the post).
        Expand this thread
        Rohin Shah 1 May 2024 8:04 UTC
        6 points
        2 ∶ 0
        Parent
        This suggests affective empathy may not be strongly predictive of utilitarian motivations.
        I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I’d feel pretty surprised if this were true in whatever distribution over unaligned AIs we’re imagining. In particular, I think if there’s no particular reason to expect affective empathy in unaligned AIs, then your prior on it being present should be near-zero (simply because there are lots of specific claims about unaligned AIs about that complicated most of which will be false). And I’d be surprised if “zero vs non-zero affective empathy” was not predictive of utilitarian motivations.
        I definitely agree that AIs might feel pleasure and pain, though I’m less confident in it than you seem to be. It just seems like AI cognition could be very different from human cognition. For example, I would guess that pain/pleasure are important for learning in humans, but it seems like this is probably not true for AI systems in the current paradigm. (For gradient descent, the learning and the cognition happen separately—the AI cognition doesn’t even get the loss/reward equivalent as an input so cannot “experience” it. For in-context learning, it seems very unclear what the pain/pleasure equivalent would be.)
        this line of argument would overlook the real possibility that unaligned AIs could [...] have an even stronger tendency towards utilitarian motivations.
        I agree this is possible. But ultimately I’m not seeing any particularly strong reasons to expect this (and I feel like your arguments are mostly saying “nothing rules it out”). Whereas I do think there’s a strong reason to expect weaker tendencies: AIs will be different, and on average different implies fewer properties that humans have. So aggregating these I end up concluding that unaligned AIs will be less utilitarian in expectation.
        (You make a bunch of arguments for why AIs might not be as different as we expect. I agree that if you haven’t thought about those arguments before you should probably reduce your expectation of how different AIs will be. But I still think they will be quite different.)
        this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans,
        I don’t see why it matters if AIs are more conscious than humans? I thought the relevant question we’re debating is whether they are more likely to be utilitarians. Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it’s a weak effect.
        As a consequence it’s hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization.
        Sure, but a big difference is that no human cares about pyramid-maximization, whereas some humans are utilitarians?
        (Maybe some humans do care about pyramid-maximization? I’d need to learn more about those humans before I could have any guess about whether to prefer humans over AIs.)
        Consciousness seems to be a fairly convergent function of intelligence
        I would say “fairly convergent function of biologically evolved intelligence”. Evolution faced lots of constraints we don’t have in AI design. For example, cognition and learning had to be colocated in space and time (i.e. done in a single brain), whereas for AIs these can be (and are) separated. Seems very plausible that consciousness-in-the-sense-of-feeling-pleasure-and-pain is a solution needed under the former constraint but not the latter. (Maybe I’m at 20% chance that something in this vicinity is right, though that is a very made-up number.)
        Matthew_Barnett 1 May 2024 19:49 UTC
        2 points
        0 ∶ 2
        Parent
        Here are a few (long, but high-level) comments I have before responding to a few specific points that I still disagree with:
        I agree there are some weak reasons to think that humans are likely to be more utilitarian on average than unaligned AIs, for basically the reasons you talk about in your comment (I won’t express individual agreement with all the points you gave that I agree with, but you should know that I agree with many of them).
        
        However, I do not yet see any strong reasons supporting your view. (The main argument seems to be: AIs will be different than us. You label this argument as strong but I think it is weak.) More generally, I think that if you’re making hugely consequential decisions on the basis of relatively weak intuitions (which is what I believe many effective altruists do in this context), you should be very cautious. The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold. (I think I was pretty careful in my language not to overstate the main claims.)
        I suspect you may have an intuition that unaligned AIs will be very alien-like in certain crucial respects, but I predict this intuition will ultimately prove to be mistaken. In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight. These factors make it quite likely, in my view, that the resulting AI systems will exhibit utilitarian tendencies to a significant degree, even if they do not share the preferences of either their users or their creators (for instance, I would guess that GPT-4 is already more utilitarian than the average human, in a meaningful sense).
        
        There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions. I am not persuaded by the idea that it’s probable for AIs to be entirely human-compatible on the surface while being completely alien underneath, even if we assume they do not share human preferences (e.g., the “shoggoth” meme).
        I disagree with the characterization that my argument relies primarily on the notion that “you can’t rule out” the possibility of AIs being even more utilitarian than humans. In my previous comment, I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a “you can’t rule it out” type of argument, in my view.
        
        Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions. To give another example (although it was not prominent in the post), in a footnote I alluded to the idea that AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend). This too does not seem like a mere “you cannot rule it out” argument to me, although I agree it is not the type of knockdown argument you’d expect if my thesis were stated way stronger than it actually was.
        I think you may be giving humans too much credit for being slightly utilitarian. To the extent that there are indeed many humans who are genuinely obsessed with actively furthering utilitarian objectives, I agree that your argument would have more force. However, I think that this is not really what we actually observe in the real world to a large degree. I think it’s exaggerated at least; even within EA I think that’s somewhat rare.
        I suspect there is a broader phenomenon at play here, whereby people (often those in the EA community) attribute a wide range of positive qualities to humans (such as the idea that our values converge upon reflection, or the idea that humans will get inherently kinder as they get wealthier) which, in my opinion, do not actually reflect the realities of the world we live in. These ideas seem (to me) to be routinely almost entirely disconnected from any empirical analysis of actual human behavior, and they sometimes appear to be more closely related to what the person making the claim wishes to be true in some kind of idealized, abstract sense (though I admit this sounds highly uncharitable).
        
        My hypothesis is that this tendency can maybe perhaps be explained by a deeply ingrained intuition that identifies the species boundary of “humans” as being very special, in the sense that virtually all moral value is seen as originating from within this boundary, sharply distinguishing it from anything outside this boundary, and leading to an inherent suspicion of non-human entities. This would explain, for example, why there is so much focus on “human values” (and comparatively little on drawing the relevant “X values” boundary along different lines), and why many people seem to believe that human emulations would be clearly preferable to de novo AI. I do not really share this intuition myself.
        I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I’d feel pretty surprised if this were true in whatever distribution over unaligned AIs we’re imagining.
        My basic thoughts here are: on the one hand we have real world data points which can perhaps relevantly inform the degree to which affective empathy actually predicts utilitarianism, and on the other hand we have an intuition that it should be predictive across beings of very different types. I think the real world data points should epistemically count for more than the intuitions? More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.
        Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it’s a weak effect.
        Isn’t this the effect you alluded to, when you named reasons why some humans are utilitarians?
        Rohin Shah 2 May 2024 9:02 UTC
        2 points
        0 ∶ 0
        Parent
        In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight.
        … This seems to be saying that because we are aligning AI, they will be more utilitarian. But I thought we were discussing unaligned AI?
        I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
        The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold.
        I agree with theses like “it tentatively appears that the normative value of alignment work is very uncertain, and plausibly approximately neutral, from a total utilitarian perspective”, and would go further and say that alignment work is plausibly negative from a total utilitarian perspective.
        I disagree with the implied theses in statements like “I’m not very sympathetic to pausing or slowing down AI as a policy proposal.”
        If you wrote a post that just said “look, we’re super uncertain about things, here’s your reminder that there are worlds in which alignment work is negative”, I’d be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to “well my main thesis was very weak and unobjectionable”.
        Some more minor comments:
        You label this argument as strong but I think it is weak
        Well, I can believe it’s weak in some absolute sense. My claim is that it’s much stronger than all of the arguments you make put together.
        There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions.
        This is a pretty good example of something I’d call different! You even use the adjective “inhumanly”!
        To the extent your argument is that this is strong evidence that the AIs will continue to be altruistic and kind, I think I disagree, though I’ve now learned that you are imagining lots of alignment work happening when making the unaligned AIs, so maybe I’d agree depending on the specific scenario you’re imagining.
        I disagree with the characterization that my argument relies primarily on the notion that “you can’t rule out” the possibility of AIs being even more utilitarian than humans.
        Sorry, I was being sloppy there. My actual claim is that your arguments either:
        Don’t seem to bear on the question of whether AIs are more utilitarian than humans, OR
        Don’t seem more compelling than the reversed versions of those arguments.
        I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a “you can’t rule it out” type of argument, in my view.
        I agree that there’s a positive reason to expect AIs to have a higher density of moral value per unit of matter. I don’t see how this has any (predictable) bearing on whether AIs will be more utilitarian than humans.
        Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
        Applying the reversal test:
        Humans have utilitarian intuitions too, and it seems very plausible that AIs would not share (or share fewer of) these intuitions.
        I don’t especially see why one of these is stronger than the other.
        (And if the AI doesn’t share any of the utilitarian intuitions, it doesn’t matter at all if it also doesn’t share the anti-utilitarian intuitions; either way it still won’t be a utilitarian.)
        To give another example [...] AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend)
        Applying the reversal test:
        AIs might care less about reproduction than humans (a large majority of whom will reproduce at least once in their life).
        Personally I find the reversed version more compelling.
        I think you may be giving humans too much credit for being slightly utilitarian. [...] people (often those in the EA community) attribute a wide range of positive qualities to humans [...]
        Fwiw my reasoning here mostly doesn’t depend on facts about humans other than binary questions like “do humans ever display property X”, since by and large my argument is “there is quite a strong chance that unaligned AIs do not have property X at all”.
        Though again this might change depending on what exactly you mean by “unaligned AI”.
        (I don’t necessarily disagree with your hypotheses as applied to the broader world—they sound plausible, though it feels somewhat in conflict with the fact that EAs care about AI consciousness a decent bit—I just disagree with them as applied to me in this particular comment thread.)
        I think the real world data points should epistemically count for more than the intuitions?
        I don’t buy it. The “real world data points” procedure here seems to be: take two high-level concepts (e.g. affective empathy, proclivity towards utilitarianism), draw a line between them, extrapolate way way out of distribution. I think this procedure would have a terrible track record when applied without the benefit of hindsight.
        I expect my arguments based on intuitions would also have a pretty bad track record, but I do think they’d outperform the procedure above.
        More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.
        Yup, this is an unfortunate fact about domains where you don’t get useful real world data. That doesn’t mean you should start using useless real world data.
        Isn’t this the effect you alluded to, when you named reasons why some humans are utilitarians?
        Yes, but I think the relevance is mostly whether or not the being feels pleasure or pain at all, rather than the magnitude with which it feels it. (Probably the magnitude matters somewhat, but not very much.)
        Among humans I would weakly predict the opposite effect, that people with less pleasure-pain salience are more likely to be utilitarian (mostly due to a predicted anticorrelation with logical thinking / decoupling / systemizing nature).
        Matthew_Barnett 2 May 2024 16:07 UTC
        4 points
        0 ∶ 0
        Parent
        Just a quick reply (I might reply more in-depth later but this is possibly the most important point):
        
        I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
        
        In my post I talked about the “default” alternative to doing lots of alignment research. Do you think that if AI alignment researchers quit tomorrow, engineers would stop doing RLHF etc. to their models? That they wouldn’t train their AIs to exhibit human-like behaviors, or to be human-compatible?
        
        It’s possible my language was misleading by giving an image of what unaligned AI looks like that isn’t actually a realistic “default” in any scenario. But when I talk about unaligned AI, I’m simply talking about AI that doesn’t share the preferences of humans (either its creator or the user). Crucially, humans are routinely misaligned in this sense. For example, employees don’t share the exact preferences of their employer (otherwise they’d have no need for a significant wage). Yet employees are still typically docile, human-compatible, and assimilated to the overall culture.
        
        This is largely the picture I think we should imagine when we think about the “default” unaligned alternative, rather than imaging that humans will create something far more alien, far less docile, and therefore something with far less economic value.
        
        (As an aside, I thought this distinction wasn’t worth making because I thought most readers would have already strongly internalized the idea that RLHF isn’t “real alignment work”. I suspect I was mistaken, and probably confused a ton of people.)
        What links here?
        Matthew_Barnett's comment on Analyzing the moral value of unaligned AIs by Matthew_Barnett (2 May 2024 16:34 UTC; 2 points)
        Matthew_Barnett 2 May 2024 16:34 UTC
        2 points
        0 ∶ 1
        Parent
        I disagree with the implied theses in statements like “I’m not very sympathetic to pausing or slowing down AI as a policy proposal.”
        This overlooks my arguments in section 3, which were absolutely critical to forming my opinion here. My argument here can be summarized as follows:
        The utilitarian arguments for technical alignment research seem weak, because AIs are likely to be conscious like us, and also share human moral concepts.
        By contrast, technical alignment research seems clearly valuable if you care about humans who currently exist, since AIs will presumably be directly aligned to them.
        However, pausing AI for alignment reasons seems pretty bad for humans who currently exist (under plausible models of the tradeoff).
        I have sympathies to both utilitarianism and the view that current humans matter. The weak considerations favoring pausing AI on the utilitarian side don’t outweigh the relatively much stronger and clearer arguments against pausing for currently existing humans.
        The last bullet point is a statement about my values. It is not a thesis independently of my values. I feel this was pretty explicit in the post.
        If you wrote a post that just said “look, we’re super uncertain about things, here’s your reminder that there are worlds in which alignment work is negative”, I’d be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to “well my main thesis was very weak and unobjectionable”.
        I’m not just saying “there are worlds in which alignment work is negative”. I’m saying that it’s fairly plausible. I’d say greater than 30% probability. Maybe higher than 40%. This seems perfectly sufficient to establish the position, which I argued explicitly, that the alternative position is “fairly weak”.
        It would be different if I was saying “look out, there’s a 10% chance you could be wrong”. I’d agree that claim would be way less interesting.
        I don’t think what I said resembles a motte-and-bailey, and I suspect you just misunderstood me.
        [ETA:
        Well, I can believe it’s weak in some absolute sense. My claim is that it’s much stronger than all of the arguments you make put together.
        Part of me feels like this statement is an acknowledgement that you fundamentally agree with me. You think the argument in favor of unaligned AIs being less utilitarian than humans is weak? Wasn’t that my thesis? If you started at a prior of 50%, and then moved to 65% because of a weak argument, and then moved back to 60% because of my argument, then isn’t that completely consistent with essentially every single thing I said? OK, you felt I was saying the probability is like 50%. But 60% really isn’t far off, and it’s consistent with what I wrote (I mentioned “weak reasons” in the post). Perhaps like 80% of the reason why you disagree here is because you think my thesis was something else.
        More generally I get the sense that you keep misinterpreting me as saying things that are different or stronger than what I intended. That’s reasonable given that this is a complicated and extremely nuanced topic. I’ve tried to express areas of agreement when possible, both in the post and in reply to you. But maybe you have background reasons to expect me to argue a very strong thesis about utilitarianism. As a personal statement, I’d encourage you to try to read me as saying something closer to the literal meaning of what I’m saying, rather than trying to infer what I actually believe underneath the surface.]
        I have lots of other disagreements with the rest of what you wrote, although I probably won’t get around to addressing them. I mostly think we just disagree on some basic intuitions about how alien-like default unaligned AIs will actually be in the relevant senses. I also disagree with your reversal tests, because I think they’re not actually symmetric, and I think you’re omitting the best arguments for thinking that they’re asymmetric.
        This, in addition to the comment I previously wrote, will have to suffice as my reply.
        Rohin Shah 30 Apr 2024 19:29 UTC
        2 points
        0 ∶ 0
        Parent
        I was arguing that trying to preserve the present generation of humans looks good according to (2), not (1).
        I was always thinking about (1), since that seems like the relevant thing. When I agreed with you that generational value drift seems worrying, that’s because it seems bad by (1). I did not mean to imply that I should act to maximize (2). I agree that if you want to act to maximize (2) then you should probably focus on preserving the current generation.
        In my post, I fairly explicitly argued that the rough level of utilitarian values exhibited by humans is likely not very contingent, in the sense of being unusually high compared to other possibilities—and this was a crucial element of my thesis. This idea was particularly important for the section discussing whether unaligned AIs will be more or less utilitarian than humans.
        Fwiw, I reread the post again and still failed to find this idea in it, and am still pretty confused at what argument you are trying to make.
        At this point I think we’re clearly failing to communicate with each other, so I’m probably going to bow out, sorry.
        Matthew_Barnett 30 Apr 2024 19:41 UTC
        2 points
        0 ∶ 0
        Parent
        Fwiw, I reread the post again and still failed to find this idea in it
        I’m baffled by your statement here. What did you think I was arguing when discussed whether “aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives”? The conclusion of that section was that aligned AIs are plausibly not more likely to have such a preference, and therefore, human utilitarian preferences here are not “unusually high compared to other possibilities” (the relevant alternative possibility here being unaligned AI).
        This was a central part of my post that I discussed at length. The idea that unaligned AIs might be similarly utilitarian or even more so, compared to humans, was a crucial part of my argument. If indeed unaligned AIs are very likely to be less utilitarian than humans, then much of my argument in the first section collapses, which I explicitly acknowledged.
        I consider your statement here to be a valuable data point about how clear my writing was and how likely I am to get my ideas across to others who read the post. That said, I believe I discussed this point more-or-less thoroughly.
        ETA: Claude 3′s summary of this argument in my post:
        The post argued that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs. This argument was made in the context of discussing whether aligned AIs are more likely to have a preference for creating new conscious entities, thereby furthering utilitarian objectives.
        The author presented several points to support this argument:
        Only a small fraction of humans are total utilitarians, and most humans do not regularly express strong preferences for adding new conscious entities to the universe.
        Some human moral intuitions directly conflict with utilitarian recommendations, such as the preference for habitat preservation over intervention to improve wild animal welfare.
        Unaligned AI preferences are unlikely to be completely alien or random compared to human preferences if the AIs are trained on human data. By sharing moral concepts with humans, unaligned AIs could potentially be more utilitarian than humans, given that human moral preferences are a mix of utilitarian and anti-utilitarian intuitions.
        Even in an aligned AI scenario, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations.
        The author concluded that these points undermine the idea that unaligned AI moral preferences will be clearly less utilitarian than the moral preferences of most humans, which are already not very utilitarian. This suggests that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs.
        Expand this thread
        Rohin Shah 30 Apr 2024 20:02 UTC
        2 points
        0 ∶ 0
        Parent
        I agree it’s clear that you claim that unaligned AIs are plausibly comparably utilitarian as humans, maybe more.
        What I didn’t find was discussion of how contingent utilitarianism is in humans.
        Though actually rereading your comment (which I should have done in addition to reading the post) I realize I completely misunderstood what you meant by “contingent”, which explains why I didn’t find it in the post (I thought of it as meaning “historically contingent”). Sorry for the misunderstanding.
        Let me backtrack like 5 comments and retry again.