Ryan Greenblatt comments on What is the current most representative EA AI x-risk argument?

Ryan Greenblatt 23 Dec 2023 21:19 UTC
1 point
0 ∶ 0
The process through which aliens would get values like ours seems much less robust than the process through which AIs gets our values. AIs are trained on our data, and humans will presumably care a lot about aligning them (at least at first).
Note that I’m conditioning on AIs successfully taking over which is strong evidence against human success at creating desirable (edit: from the perspective of the creators) AIs.
if I understood your views further. I find it highly unlikely that AGIs will be even more “alien” from the perspective of our values than literal aliens
For an intuition pump, consider future AIs which are trained for the equivalent of 100 million years of next-token-prediction^[1] on low quality web text and generated data and then aggressively selected with outcomes based feedback. This outcomes based feedback results in selecting the AIs for carefully tricking their human overseers in a variety of cases and generally ruthlessly pursuing reward.
This scenario is somewhat worse than what I expect in the median world. But in practice I expect that it’s at least systematically possible to change the training setup to achieve in predictably better AI motivation and values. Beyond trying to influence AI motivations with crude tools, it seems even better to have humans retain control, use AIs to do a huge amount of R&D (or philosophy work), and then decide what should actually happen with access to more options.
Another way to put this is that I feel notably better about the decisions making of current power structures in the western world and in AIs labs than I feel about going with AI motivations which likely result from training.
More generally, if you are the sole person in control, it seems strictly better from your perspective to carefully reflect on who/what you want to defer to rather than doing this somewhat arbitrarily (this still leaves open the question of how bad arbitrarily defering is).
From my perspective this is a bit like saying you’d prefer aliens to take over the universe rather than handing control over to our genetically engineered human descendants. I’d be very skeptical of that view too for some basic reasons.
I’m pretty happy with slow and steady genetic engineering as a handover process, but I would prefer even slower and more deliberate than this. E.g., existing humans thinking carefully for as long as seems to yield returns about what beings we should defer to and then defer to those slightly smart beings which think for a long time and defer to other beings, etc, etc.
I guess I mostly think that’s a pretty bizarre view, with some obvious reasons for doubt, and I don’t know what would be driving it.
Part of my view on aliens or dogs is driven from the principle of “aliens/dogs are in a somewhat similar position to us, so we should be fine with swapping” (roughly speaking) and “the part of my values which seem most dependent on random emprical contingencies about evolved life I put less weight on”. These intuitions transfer somewhat less to the AI case.
1. ^
  Current AIs are trained on perhaps 10-100 trillion tokens and if we think 1 token the equivalent of 1 second then (100*10^12)/(60*60*24*365) = 3 milion years.
- Matthew_Barnett 23 Dec 2023 21:50 UTC
  2 points
  0 ∶ 0
  Parent
  Note that I’m conditioning on AIs successfully taking over which is strong evidence against human success at creating desirable AIs.
  I don’t think it’s strong evidence, for what it’s worth. I’m also not sure what “AI takeover” means, and I think existing definitions are very ambiguous (would we say Europe took over the world during the age of imperialism? Are smart people currently in control of the world? Have politicians, as a class, taken over the world?). Depending on the definition, I tend to think that AI takeover is either ~inevitable and not inherently bad, or bad but not particularly likely.
  This outcomes based feedback results in selecting the AIs for carefully tricking their human overseers in a variety of cases and generally ruthlessly pursuing reward.
  Would aliens not also be incentivized to trick us or others? What about other humans? In my opinion, basically all the arguments about AI deception from gradient descent apply in some form to other methods of selecting minds, including evolution by natural selection, cultural learning, and in-lifetime learning. Humans frequently lie to or mislead each other about our motives. For example, if you ask a human what they’d do if they became world dictator, I suspect you’d often get a different answer than the one they’d actually chose if given that power. I think this is essentially the same epistemic position we might occupy with AI.
  Also, for a bunch of reasons that I don’t currently feel like elaborating on, I expect humans to anticipate, test for, and circumvent the most egregious forms of AI deception in practice. The most important point here is that I’m not convinced that incentives for deception are much worse for AIs than for other actors in different training regimes (including humans, uplifted dogs, and aliens).
  - Ryan Greenblatt 24 Dec 2023 0:35 UTC
    1 point
    0 ∶ 0
    Parent
    I don’t think it’s strong evidence, for what it’s worth. I’m also not sure what “AI takeover” means, and I think existing definitions are very ambiguous (would we say Europe took over the world during the age of imperialism? Are smart people currently in control of the world? Have politicians, as a class, taken over the world?). Depending on the definition, I tend to think that AI takeover is either ~inevitable and not inherently bad, or bad but not particularly likely.
    By “AI takeover”, I mean autonomous AI coup/revolution. E.g., violating the law and/or subverting the normal mechanisms of power transfer. (Somewhat unclear exactly what should count tbc, but there are some central examples.) By this definition, it basically always involves subverting the intentions of the creators of the AI, though may not involve violent conflict.
    I don’t think this is super likely, perhaps 25% chance.
  - Ryan Greenblatt 24 Dec 2023 0:33 UTC
    1 point
    0 ∶ 0
    Parent
    Also, for a bunch of reasons that I don’t currently feel like elaborating on, I expect humans to anticipate, test for, and circumvent the most egregious forms of AI deception in practice. The most important point here is that I’m not convinced that incentives for deception are much worse for AIs than for other actors in different training regimes (including humans, uplifted dogs, and aliens).
    I don’t strongly disagree with either of these claims, but this isn’t exactly where my crux lies.
    The key thing is “generally ruthlessly pursuing reward”.
    I’m checking out of this conversation though.
    - Matthew_Barnett 24 Dec 2023 4:10 UTC
      2 points
      0 ∶ 0
      Parent
      The key thing is “generally ruthlessly pursuing reward”.
      It depends heavily on what you mean by this, but I’m kinda skeptical of the strong version of ruthless reward seekers, for similar reasons given in this post. I think AIs by default might be ruthless in some other senses—since we’ll be applying a lot of selection pressure to them to get good behavior—but I’m not really sure how how much weight to put on the fact that AIs will be “ruthless” when evaluating how good they are at being our successors. It’s not clear how that affects my evaluation of how much I’d be OK handing the universe over to them, and my guess is the answer is “not much” (absent more details).
      Humans seem pretty ruthless in certain respects too, e.g. about survival, or increasing their social status. I’d expect aliens, and potentially uplifted dogs to be ruthless too along some axes depending on how we uplifted them.
      I’m checking out of this conversation though.
      Alright, that’s fine.