Matthew_Barnett comments on What is the current most representative EA AI x-risk argument?

Matthew_Barnett 23 Dec 2023 8:46 UTC
3 points
0 ∶ 1

I think the go-ahead-point for longtermists probably looks like 0.1% to 0.01% reduction in p(doom) per year for longtermists, but this might depend on how optimistic you are about other aspects of society.

To be clear, my argument would be that the go-ahead-point for longtermists likely looks much higher, like a 10% total risk of catastrophe. Actually that’s not exactly how I’d frame it, since what matters more is how much we can reduce the risk of catastrophe by delaying, not just the total risk of a catastrophe. But I’d likely consider a world where we delay AI until the total risk falls below 0.1% to be intolerable from several perspectives.

I guess one way of putting my point here is that you probably think of “human disempowerment” as a terminal state that is astronomically bad, and probably far worse than “all currently existing humans die”. But I don’t really agree with this. Human disempowerment just means that the species homo sapiens is disempowered, and I don’t see why we should draw the relevant moral boundary around our species. We can imagine other boundaries like “our current cultural and moral values”, which I think would drift dramatically over time even if the human species remained.

I’m just not really attached to the general frame here. I don’t identify much with “human values” in the abstract as opposed to other salient characteristics of intelligent beings. I think standard EA framing around “humans” is simply bad in an important way relevant to these arguments (and this includes most attempts I’ve seen to broaden the standard arguments to remove references to humans). Even when an EA insists their concern isn’t about the human species per se I typically end up disagreeing on some other fundamental point here that seems like roughly the same thing I’m pointing at. Unfortunately, I consistently have trouble conveying this point to people, so I’m not likely to be understood here unless I give a very thorough argument.

I suspect it’s a bit like the arguments vegans have with non-vegans about whether animals are OK to eat because they’re “not human”. There’s a conceptual leap from “I care a lot about humans” to “I don’t necessarily care a lot about the human species boundary” that people don’t reliably find intuitive except perhaps after a lot of reflection. Most ordinary instances of arguments between vegans and non-vegans are not going to lead to people successfully crossing this conceptual gap. It’s just a counterintuitive concept for most people.

Perhaps as a brief example to help illustrate my point, it seems very plausible to me that I would identify more strongly with a smart behavioral LLM clone of me trained on my personal data compared to how much I’d identify with the human species. This includes imperfections in the behavioral clone arising from failures to perfectly generalize from my data (though excluding extreme cases like the entity not generalizing any significant behavioral properties at all). Even if this clone were not aligned with humanity in the strong sense often meant by EAs, I would not obviously consider it bad to give this behavioral clone power, even at the expense of empowering “real humans”.

On top of all of this, I think I disagree with your argument about discount rates, since I think you’re ignoring the case for high discount rates based on epistemic uncertainty, rather than pure time preferences.
- Ryan Greenblatt 23 Dec 2023 17:42 UTC
  5 points
  0 ∶ 0
  Parent
  Also, another important clarification is that my views are probably quite different from that of the median EA who identifies as longtermist. So I’d be careful not to pattern match me.
  
  (And I prefer not to identify with movements, so I’d say that I’m not an EA.)
- Ryan Greenblatt 23 Dec 2023 17:28 UTC
  1 point
  0 ∶ 0
  Parent
  I think you misunderstood the points I was making. Sorry for writing an insufficently clear comment.
  Actually that’s not exactly how I’d frame it, since what matters more is how much we can reduce the risk of catastrophe by delaying, not just the total risk of a catastrophe.
  Agreed that’s why I wrote “0.1% to 0.01% reduction in p(doom) per year”. I wasn’t talking about the absolute level of doom here. I edited my comment to say “0.1% to 0.01% reduction in p(doom) per year of delay” which is hopefully more clear. The expected absolute level of doom is probably notably higher than 0.1% to 0.01%.
  Human disempowerment just means that the species homo sapiens is disempowered, and I don’t see why we should draw the relevant moral boundary around our species.
  I don’t. That’s why I said “Similarly, I would potentially be happier to turn over the universe to aliens instead of AIs.”
  Also, note that I think AI take over is unlikely to lead to extinction.
  ETA: I’m pretty low confidence about a bunch of these tricky moral questions.
  I would be reasonably happy (e.g. 50-90% of the value relative to human control) to turn the universe over to aliens. The main reduction in value is due to complicated questions about the likely distribution of values of aliens. (E.g., how likely is that aliens are very sadistic or lack empathy. This is probably still not the exact right question.) I’d also be pretty happy with (e.g.) uplifted dogs (dogs which are made to be as intelligent as humans while keeping the core of “dog” whatever that means) so long as the uplifting process was reasonable.
  I think the exact same questions apply to AIs, I just have empirical beliefs that AIs which end up taking over are likely to do predictably worse things with the cosmic endowment (e.g. 10-30% of the value). This doesn’t have to be true, I can imagine learning facts about AIs which would make me feel a lot better about AI takeover. Note that conditioning on the AI taking over is important here. I expect to feel systematically better about smart AIs with long horizon goals which are either not quite smart enough to take over or don’t take over (for various complicated reasons).
  More generally, I think I basically endorse the views here (which discusses the questions of when you should cede power etc.).
  Note that in my ideal future it seems really unlikely that we end up spending a non-trivial fraction of future resources running literal humans instead of finding out better stuff to spend computational resources on (e.g. like beings with experiences that a wildly better than our experiences or beings which are vastly cheaper to run).
  (That said, we can and should let all humans live for as long as they want and dedicate some fraction of resources to basic continuity of human civilization insofar as people want this. 1/10^12 of the resources would easily suffice from my perspective, but I’m sympathic to making this more like 1/10^3 or 1/10^6.)
  Perhaps as a brief example to help illustrate my point, it seems very plausible to me that I would identify more strongly with a smart behavioral LLM clone of me trained on my personal data compared to how much I’d identify with the human species.
  I think “identify” is the wrong word from my perspective. The key question is “what would the smart behavioral clone do with the vast amount of future resources”. That said, I’m somewhat sympathetic to the claim that this behavioral clone would do basically reasonable things with future resources. I also feel reasonably optimistic about pure imitation LLM alignment for somewhat similar reasons.
  On top of all of this, I think I disagree with your argument about discount rates, since I think you’re ignoring the case for high discount rates based on epistemic uncertainty, rather than pure time preferences.
  Am I ignoring this case? I just think we should treat “what do I terminally value”^[1] and “what is the best route to achieving that” as most separate questions. So, we should talk about whether “high discount rates due to epistemic uncertainty” is a good reasoning heuristic for achieving my terminal values separately from what my terminal values are.
  Separately, I think a high per year discount rate due to epistemic uncertainty seems pretty clearly wrong. I’m pretty confiden that I can influence, to at least a small degree (e.g. I can affect the probability by >10^-10, probably much greater), whether or not the moral equivalent of 10^30 people are tortured in 10^6 years. It seems like a very bad idea from my perspective to put literally zero weight on this due to 1% annual discount rates.
  For less specific things like “does a civilization descended from and basically endorsed by humans exist in 10^6 years”, I think I have considerable influence. E.g., I can affect the probability by >10^-6 (in expectation). (This influence is distinct from the question of how valuable this is to influence, but we were talking about epistemic uncertainty here.)
  My guess is that we end up with basically a moderate fixed discount over very long run future influence due to uncertainty over how the future will go, but this is more like 10% or 1% than 10^-30. And, because the long run future still dominates in my views, this just multiplies though all calculations and ends up not mattering much for decision making. (I think acausal trade considerations implicitly mean that I would be willing to tradeoff long run considerations in favor of things which look good as weighted by current power structures (e.g. helping homeless children in the US) if I had a 1,000x-10,000x opportunity to do this. E.g., if I could stop 10,000 US children from being homeless with a day of work and couldn’t do direct trade, I would still do this.
  1. ^
    More precisely, what would my CEV (Coherant Extrapolated Volition) want and how do I handle uncertainty about what my CEV would want?
  - Matthew_Barnett 23 Dec 2023 20:13 UTC
    2 points
    0 ∶ 0
    Parent
    Agreed that’s why I wrote “0.1% to 0.01% reduction in p(doom) per year”. I wasn’t talking about the absolute level of doom here. I edited my comment to say “0.1% to 0.01% reduction in p(doom) per year of delay” which is hopefully more clear
    Ah, sorry. I indeed interpreted you as saying that we would reduce p(doom) to 0.01-0.1% per year, rather than saying that each year of delay reduces p(doom) by that amount. I think that view is more reasonable, but I’d still likely put the go-ahead-number higher.
    That’s why I said “Similarly, I would potentially be happier to turn over the universe to aliens instead of AIs.”
    Apologies again for misinterpreting. I didn’t know how much weight to put on the word “potentially” in your comment. Although note that I said, “Even when an EA insists their concern isn’t about the human species per se I typically end up disagreeing on some other fundamental point here that seems like roughly the same thing I’m pointing at.” I don’t think the problem is literally that EAs are anthropocentric, but I think they often have anthropocentric intuitions that influence these estimates.
    Maybe a more accurate summary is that people have a bias towards “evolved” or “biological” beings, which I think might explain why you’d be a little happier to hand over the universe to aliens, or dogs, but not AIs.
    I would be reasonably happy (e.g. 50-90% of the value relative to human control) to turn the universe over to aliens. [...]
    I think the exact same questions apply to AIs, I just have empirical beliefs that AIs which end up taking over are likely to do predictably worse things with the cosmic endowment (e.g. 10-30% of the value).
    I guess I mostly think that’s a pretty bizarre view, with some obvious reasons for doubt, and I don’t know what would be driving it. The process through which aliens would get values like ours seems much less robust than the process through which AIs gets our values. AIs are trained on our data, and humans will presumably care a lot about aligning them (at least at first).
    From my perspective this is a bit like saying you’d prefer aliens to take over the universe rather than handing control over to our genetically engineered human descendants. I’d be very skeptical of that view too for some basic reasons.
    Overall, upon learning your view here, I don’t think I’d necessarily diagnose you as having the intuitions I alluded to in my original comment, but I think there’s likely something underneath your views that I would strongly disagree with, if I understood your views further. I find it highly unlikely that AGIs will be even more “alien” from the perspective of our values than literal aliens (especially if we’re talking about aliens who themselves build their own AIs, genetically engineer themselves, and so on).
    - Ryan Greenblatt 24 Dec 2023 3:03 UTC
      1 point
      0 ∶ 0
      Parent
      If you’re interested in diving into “how bad/good is it to cede the universe to AIs”, I strongly think it’s worth reading and responding to “When is unaligned AI morally valuable?” which is the current state of the art on the topic (same thing I linked above). I now regret rehashing a bunch of these arguments which I think are mostly made better here. In particular, I think the case for “AIs created in the default way might have low moral value is reasonably well argued for here:
      Many people have a strong intuition that we should be happy for our AI descendants, whatever they choose to do. They grant the possibility of pathological preferences like paperclip-maximization, and agree that turning over the universe to a paperclip-maximizer would be a problem, but don’t believe it’s realistic for an AI to have such uninteresting preferences.
      I disagree. I think this intuition comes from analogizing AI to the children we raise, but that it would be just as accurate to compare AI to the corporations we create. Optimists imagine our automated children spreading throughout the universe and doing their weird-AI-analog of art; but it’s just as realistic to imagine automated PepsiCo spreading throughout the universe and doing its weird-AI-analog of maximizing profit.
      It might be the case that PepsiCo maximizing profit (or some inscrutable lost-purpose analog of profit) is intrinsically morally valuable. But it’s certainly not obvious.
      Or it might be the case that we would never produce an AI like a corporation in order to do useful work. But looking at the world around us today that’s certainly not obvious.
      Neither of those analogies is remotely accurate. Whether we should be happy about AI “flourishing” is a really complicated question about AI and about morality, and we can’t resolve it with a one-line political slogan or crude analogy.
      (And the same recommendation for onlookers.)
      - Matthew_Barnett 24 Dec 2023 3:52 UTC
        2 points
        0 ∶ 0
        Parent
        I now regret rehashing a bunch of these arguments which I think are mostly made better here.
        It’s fine if you don’t want to continue this discussion. I can sympathize if you find it tedious. That said, I don’t really see why you’d appeal to that post in this context (FWIW, I read the post at the time it came out, and just re-read it). I interpret Paul Christiano to mainly be making arguments in the direction of “unaligned AIs might be morally valuable, even if we’d prefer aligned AI” which is what I thought I was broadly arguing for, in contradistinction to your position. I thought you were saying something closer to the opposite of what Paul was arguing for (although you also made several separate points, and I don’t mean to oversimplify your position).
        (But I agree with the quoted part of his post that we shouldn’t be happy with AIs doing “whatever they choose to do”. I don’t think I’m perfectly happy with unaligned AI. I’d prefer we try to align AIs, just as Paul Christiano says too.)
        Ryan Greenblatt 24 Dec 2023 4:34 UTC
        1 point
        0 ∶ 0
        Parent
        Huh, no I almost entirely agree with this post as I noted in my prior comment. I cited this much earlier: “More generally, I think I basically endorse the views here (which discusses the questions of when you should cede power etc.).”
        
        I do think unaligned ai would be morally valuable (I said in an earlier comment unaligned ai which take over might capture 10-30% of the value. That’s a lot of value.)
        
        I don’t think I’m perfectly happy with unaligned AI. I’d prefer we try to align AIs, just as Paul Christiano says too.
        
        I think we’ve probably been talking past each other. I thought the whole argument here was “how much value do we lose if (presumably misaligned) AI takes over” and you were arguing for “not much, caring about this seems like overly fixating on humanity” and I was arguing “(presumably misaligned) ais which take over probably results in substantially less value”. This now seems incorrect and we perhaps only have minor quantitative disagreements?
        
        I think it probably would have helped if you were more quantitative here. Exactly how much of the value?
        Matthew_Barnett 24 Dec 2023 6:35 UTC
        3 points
        0 ∶ 0
        Parent
        
        I thought the whole argument here was “how much value do we lose if (presumably misaligned) AI takes over”
        
        I think the key question here is: compared to what? My position is that we lose a lot of potential value both from delaying AI and from having unaligned AI, but it’s not a crazy high reduction in either case. In other words they’re pretty comparable in terms of lost value.
        
        Ranking the options in rough order (taking up your offer to be quantitative):
        
        Aligned AIs built tomorrow: 100% of the value from my perspective
        
        Aligned AIs built in 100 years: 50% of the value
        
        Unaligned AIs built tomorrow: 15% of the value
        
        Unaligned AIs built in 100 years: 25% of the value
        
        Note that I haven’t thought about these exact numbers much.
        frib 24 Dec 2023 7:15 UTC
        1 point
        1 ∶ 0
        Parent
        
        Aligned AIs built in 100 years: 50% of the value
        
        What drives this huge drop? Naive utility would be very close to 100%. (Do you mean “aligned ais built in 100y if humanity still exists by that point”, which includes extinction risk before 2123?)
        Matthew_Barnett 24 Dec 2023 9:57 UTC
        4 points
        0 ∶ 0
        Parent
        I attempted to explain the basic intuitions behind my judgement in this thread. Unfortunately it seems I did a poor job. For the full explanation you’ll have to wait until I write a post, if I ever get around to doing that.
        
        The simple, short, and imprecise explanation is: I don’t really value humanity as a species as much as I value the people who currently exist, (something like) our current communities and relationships, our present values, and the existence of sentient and sapient life living positive experiences. Much of this will go away after 100 years.
    - Ryan Greenblatt 23 Dec 2023 21:28 UTC
      1 point
      0 ∶ 0
      Parent
      TBC, it’s plausible that in the future I’ll think that “marginally influencing AIs to have more sensible values” is more leveraged than “avoiding AI take over and hoping that humans (and our chosen successors) do something sensible”. I’m partially defering to others on the view that AI takeover is the best angle of attack, perhaps I should examine further.
      
      (Of course, it could be that from a longtermist perspective other stuff is even better than avoiding AI takeover or altering AI values. E.g. maybe one of conflict avoidance, better decision theory, or better human institutions for post singularity is even better.)
      
      I certainly wish the question of how much worse/better AI takeover is relative to human control was investigated more effectively. It seems notable to me how important this question is from a longtermist perspective and how little investigation it has received.
      (I’ve spent maybe 1 person day thinking about it and I think probably less than 3 FTE years have been put into this by people who I’d be interested in defering to.)
    - Ryan Greenblatt 23 Dec 2023 21:19 UTC
      1 point
      0 ∶ 0
      Parent
      The process through which aliens would get values like ours seems much less robust than the process through which AIs gets our values. AIs are trained on our data, and humans will presumably care a lot about aligning them (at least at first).
      Note that I’m conditioning on AIs successfully taking over which is strong evidence against human success at creating desirable (edit: from the perspective of the creators) AIs.
      if I understood your views further. I find it highly unlikely that AGIs will be even more “alien” from the perspective of our values than literal aliens
      For an intuition pump, consider future AIs which are trained for the equivalent of 100 million years of next-token-prediction^[1] on low quality web text and generated data and then aggressively selected with outcomes based feedback. This outcomes based feedback results in selecting the AIs for carefully tricking their human overseers in a variety of cases and generally ruthlessly pursuing reward.
      This scenario is somewhat worse than what I expect in the median world. But in practice I expect that it’s at least systematically possible to change the training setup to achieve in predictably better AI motivation and values. Beyond trying to influence AI motivations with crude tools, it seems even better to have humans retain control, use AIs to do a huge amount of R&D (or philosophy work), and then decide what should actually happen with access to more options.
      Another way to put this is that I feel notably better about the decisions making of current power structures in the western world and in AIs labs than I feel about going with AI motivations which likely result from training.
      More generally, if you are the sole person in control, it seems strictly better from your perspective to carefully reflect on who/what you want to defer to rather than doing this somewhat arbitrarily (this still leaves open the question of how bad arbitrarily defering is).
      From my perspective this is a bit like saying you’d prefer aliens to take over the universe rather than handing control over to our genetically engineered human descendants. I’d be very skeptical of that view too for some basic reasons.
      I’m pretty happy with slow and steady genetic engineering as a handover process, but I would prefer even slower and more deliberate than this. E.g., existing humans thinking carefully for as long as seems to yield returns about what beings we should defer to and then defer to those slightly smart beings which think for a long time and defer to other beings, etc, etc.
      I guess I mostly think that’s a pretty bizarre view, with some obvious reasons for doubt, and I don’t know what would be driving it.
      Part of my view on aliens or dogs is driven from the principle of “aliens/dogs are in a somewhat similar position to us, so we should be fine with swapping” (roughly speaking) and “the part of my values which seem most dependent on random emprical contingencies about evolved life I put less weight on”. These intuitions transfer somewhat less to the AI case.
      ^
      Current AIs are trained on perhaps 10-100 trillion tokens and if we think 1 token the equivalent of 1 second then (100*10^12)/(60*60*24*365) = 3 milion years.
      - Matthew_Barnett 23 Dec 2023 21:50 UTC
        2 points
        0 ∶ 0
        Parent
        Note that I’m conditioning on AIs successfully taking over which is strong evidence against human success at creating desirable AIs.
        I don’t think it’s strong evidence, for what it’s worth. I’m also not sure what “AI takeover” means, and I think existing definitions are very ambiguous (would we say Europe took over the world during the age of imperialism? Are smart people currently in control of the world? Have politicians, as a class, taken over the world?). Depending on the definition, I tend to think that AI takeover is either ~inevitable and not inherently bad, or bad but not particularly likely.
        This outcomes based feedback results in selecting the AIs for carefully tricking their human overseers in a variety of cases and generally ruthlessly pursuing reward.
        Would aliens not also be incentivized to trick us or others? What about other humans? In my opinion, basically all the arguments about AI deception from gradient descent apply in some form to other methods of selecting minds, including evolution by natural selection, cultural learning, and in-lifetime learning. Humans frequently lie to or mislead each other about our motives. For example, if you ask a human what they’d do if they became world dictator, I suspect you’d often get a different answer than the one they’d actually chose if given that power. I think this is essentially the same epistemic position we might occupy with AI.
        Also, for a bunch of reasons that I don’t currently feel like elaborating on, I expect humans to anticipate, test for, and circumvent the most egregious forms of AI deception in practice. The most important point here is that I’m not convinced that incentives for deception are much worse for AIs than for other actors in different training regimes (including humans, uplifted dogs, and aliens).
        Ryan Greenblatt 24 Dec 2023 0:35 UTC
        1 point
        0 ∶ 0
        Parent
        I don’t think it’s strong evidence, for what it’s worth. I’m also not sure what “AI takeover” means, and I think existing definitions are very ambiguous (would we say Europe took over the world during the age of imperialism? Are smart people currently in control of the world? Have politicians, as a class, taken over the world?). Depending on the definition, I tend to think that AI takeover is either ~inevitable and not inherently bad, or bad but not particularly likely.
        By “AI takeover”, I mean autonomous AI coup/revolution. E.g., violating the law and/or subverting the normal mechanisms of power transfer. (Somewhat unclear exactly what should count tbc, but there are some central examples.) By this definition, it basically always involves subverting the intentions of the creators of the AI, though may not involve violent conflict.
        I don’t think this is super likely, perhaps 25% chance.
        Ryan Greenblatt 24 Dec 2023 0:33 UTC
        1 point
        0 ∶ 0
        Parent
        Also, for a bunch of reasons that I don’t currently feel like elaborating on, I expect humans to anticipate, test for, and circumvent the most egregious forms of AI deception in practice. The most important point here is that I’m not convinced that incentives for deception are much worse for AIs than for other actors in different training regimes (including humans, uplifted dogs, and aliens).
        I don’t strongly disagree with either of these claims, but this isn’t exactly where my crux lies.
        The key thing is “generally ruthlessly pursuing reward”.
        I’m checking out of this conversation though.
        Matthew_Barnett 24 Dec 2023 4:10 UTC
        2 points
        0 ∶ 0
        Parent
        The key thing is “generally ruthlessly pursuing reward”.
        It depends heavily on what you mean by this, but I’m kinda skeptical of the strong version of ruthless reward seekers, for similar reasons given in this post. I think AIs by default might be ruthless in some other senses—since we’ll be applying a lot of selection pressure to them to get good behavior—but I’m not really sure how how much weight to put on the fact that AIs will be “ruthless” when evaluating how good they are at being our successors. It’s not clear how that affects my evaluation of how much I’d be OK handing the universe over to them, and my guess is the answer is “not much” (absent more details).
        Humans seem pretty ruthless in certain respects too, e.g. about survival, or increasing their social status. I’d expect aliens, and potentially uplifted dogs to be ruthless too along some axes depending on how we uplifted them.
        I’m checking out of this conversation though.
        Alright, that’s fine.