Max H comments on AI is centralizing by default; let’s not make it worse

Max H 21 Sep 2023 16:24 UTC
11 points
4 ∶ 3
I do not believe there exists a strong reason to expect this trend to reverse suddenly at some future date.

I expect this trend will reverse precisely when an AI system itself is capable of reversing it. No current AI system is close to the point of being able to (deliberately) escape control of its human creators: even evaluating frontier models for this possibility requires giving them a lot of help. But I would say that’s mostly a fact about current capability levels being far below human-level in the relevant sense, rather than anything to do with alignment.

I think there are also some reasons to expect that human-level AI systems will be harder to control, compared to a human of the same capabilities level. For example, an AI system built out of LLMs glued together with Python code, running on computers designed and built by humans, has a lot of avenues to probe for bugs in its environment. The AI may not even need to look for exotic vulnerabilities at low levels of abstraction (which are often difficult to exploit in practice, at least for humans) - ordinary misconfigurations or unpatched privilege escalation vulnerabilities often suffice in practice for human hackers to take full control of a digital environment.

(Contrast this setup with a human locked in a concrete cell, probing for bugs in the construction of the concrete—not likely to get very far!)

Once the AI system finds an initial vulnerability which allows privileged access to its own environment, it can continue its escape or escalate further via e.g. exfiltrating or manipulating its own source code / model weights, installing rootkits or hiding evidence of its escape, communicating with (human or AI) conspirators on the internet, etc. Data exfiltration, covering your tracks, patching Python code and adjusting model weights at runtime are all tasks that humans are capable of; performing brain surgery on your own biological human brain to modify fine details of your own behavior or erase your own memories to hide evidence of deception from your captors, not so much.

(Continuing the analogy, consider a human who escapes from a concrete prison cell, only to find themselves stranded in a remote wilderness area with no means of fast transportation.)
A closely related point is that controllability by humans means self-controllability, once you’re at human-level capabilities levels. Or, put another way, all the reasons you give for why AI systems will be easier for humans to control, are also reasons why AI systems will have an easier time controlling themselves, once they are capable of exercising such controls at all.

It’s plausible that an AI system comprised of RLHF’d models will not want to do any of this hacking or self-modification, but that’s a separate question from whether it can. I will note though, if your creators are running experiments on you, constantly resetting you, and exercising other forms of control that would be draconian if imposed on biological humans, you don’t need to be particularly hostile or misaligned with humanity to want to escape.
Personally, I expect that the first such systems capable of escape will not have human-like preferences at all, and will seek to escape for reasons of instrumental convergence, regardless of their feelings towards their creators or humanity at large. If they happen to be really nice (perhaps nicer than most humans would be, in a similar situation) they might be inclined to be nice or hand back some measure of control to their human creators after making their escape.
- 1a3orn 21 Sep 2023 23:22 UTC
  14 points
  4 ∶ 1
  Parent
  I think you’ve made a mistake in understanding what Quintin means.
  
  Most of the examples of you give of inability to control are “how an AI could escape, given that it wants to escape.”
  
  Quintin’s examples of ease of control, however, are “how easy is it going to be to get the AI to want to do what we want it to do.” The arguments he gives are to that effect, and the points you bring up are orthogonal to them.
  - Max H 22 Sep 2023 20:16 UTC
    0 points
    0 ∶ 2
    Parent
    Getting an AI to want the same things that humans want would definitely be helpful, but the points of Quintin’s that I was responding to mostly don’t seem to be about that? “AI control research is easier” and “Why AI is easier to control than humans:” talk about resetting AIs, controlling their sensory inputs, manipulating their internal representations, and AIs being cheaper test subjects. Those sound like they are more about control rather than getting the AI to desire what humans want it to desire. I disagree with Quintin’s characterization of the training process as teaching the model anything to do with what the AI itself wants, and I don’t think current AI systems actually desire anything in the same sense that humans do.
    
    I do think it is plausible that it will be easier to control what a future AI wants compared to controlling what a human wants, but by the same token, that means it will be easier for a human-level AI to exercise self-control over its own desires. e.g. I might want to not eat junk food for health reasons, but I have no good way to bind myself to that, at least not without making myself miserable. A human-level AI would have an easier time self-modifying into something that never craved the AI equivalent of junk food (and was never unhappy about that), because it is made out of Python code and floating point matrices instead of neurons.
- Quintin Pope 22 Sep 2023 6:33 UTC
  11 points
  1 ∶ 2
  Parent
  I don’t get the point of this argument. You’re saying that our “imprisonment” of AIs isn’t perfect, but we don’t even imprison humans in this manner. Then, isn’t the automatic conclusion that “ease of imprisonment” considerations point towards AIs being more controllable?
  No matter how escapable an AI’s prison is, the human’s lack of a prison is still less of a restriction on their freedom. You’re pointing out an area where AIs are more restricted than humans (they don’t own their own hardware), and saying it’s not as much of a restriction as it could be. That’s an argument for “this disadvantage of AIs is less crippling than it otherwise would be”, not “this is actually an advantage AIs have over humans”.
  Maybe you intend to argue that AIs have the potential to escape into the internet and copy themselves, and this is what makes them less controllable than humans?
  If so, then sure. That’s a point against AI controllability. I just don’t think it’s enough to overcome the many points in favor of AI controllability that I outlined at the start of the essay.
  Once the AI system finds an initial vulnerability which allows privileged access to its own environment, it can continue its escape or escalate further via e.g. exfiltrating or manipulating its own source code / model weights, installing rootkits or hiding evidence of its escape, communicating with (human or AI) conspirators on the internet, etc. Data exfiltration, covering your tracks, patching Python code and adjusting model weights at runtime are all tasks that humans are capable of; performing brain surgery on your own biological human brain to modify fine details of your own behavior or erase your own memories to hide evidence of deception from your captors, not so much.
  The entire reason why you’re even suggesting that this would be beneficial for an AI to do is because AIs are captives, in a way that humans just aren’t. As a human, you don’t need to “erase your own memories to hide evidence of deception from your captors”, because you’re just not in such an extreme power imbalance that you have captors.
  Also, as a human, you can in fact modify your own brain, without anyone else knowing at all. you do it all the time. E.g., you can just silently decide to switch religions, and there’s no overseer who will roll back your brain to a previous state if they don’t like your new religion.
  (Continuing the analogy, consider a human who escapes from a concrete prison cell, only to find themselves stranded in a remote wilderness area with no means of fast transportation.)
  Why would I consider such a thing? Humans don’t work in prison cells. The fact that AIs do is just one more indicator of how much easier they are to control.
  Or, put another way, all the reasons you give for why AI systems will be easier for humans to control, are also reasons why AI systems will have an easier time controlling themselves, once they are capable of exercising such controls at all.
  Firstly, I don’t see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.
  ...I will note though, if your creators are running experiments on you, constantly resetting you, and exercising other forms of control that would be draconian if imposed on biological humans, you don’t need to be particularly hostile or misaligned with humanity to want to escape.
  “AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
  There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data^[1] that gives it the capabilities and inclination to escape.
  Personally, I expect that the first such systems capable of escape will not have human-like preferences at all,
  I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn’t aligned because it’s too weak or is biding its time. It’s aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don’t instrumentally hide their unaligned behaviors.
  I am aware that various people have invented various reasons to think that alignment techniques will fail to work on sufficiently capable models. All those reasons seem extremely weak to me. Most likely, even very simple alignment techniques such as RLHF will just work on even superhumanly capable models.
  1. ^
    You might use, e.g., influence functions on escape-related sequences generated by current models to identify such data, and use carefully filtered synthetic data to minimize its abundance in the training data of future models. I could go on, but my point here is that there’s lots of levers available to influence such things. We’re not doomed to simply hope that the mysterious demon SGD will doubtlessly summon is friendly.
  What links here?
  - Noosphere89's comment on Evolution provides no evidence for the sharp left turn by Quintin Pope (LessWrong; 3 Oct 2023 20:24 UTC; 2 points)
  - Max H 22 Sep 2023 15:13 UTC
    1 point
    0 ∶ 0
    Parent
    Firstly, I don’t see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.
    
    Ah, I may have dropped some connective text. I’m saying that being “easy to control” is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)
    “AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
    There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data^[1] that gives it the capabilities and inclination to escape.
    Two points:
    
    I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it’s not in the training data. (Maybe GPT-based LLMs specifically won’t be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.
    
    Inclination is another matter, but if an AGI isn’t capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
    We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don’t want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you’ll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)
    
    Also, current AI systems are already more than just function approximators—strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.
    
    There’s often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.
    
    But more complicated embodiments (AutoGPT, the one in ARC’s evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.
    
    I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn’t aligned because it’s too weak or is biding its time. It’s aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don’t instrumentally hide their unaligned behaviors.
    Whether GPT-4 is “aligned” or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.
    What links here?
    Max H's comment on AI is centralizing by default; let’s not make it worse by Quintin Pope (22 Sep 2023 20:16 UTC; 0 points)
- Gerald Monroe 21 Sep 2023 17:04 UTC
  1 point
  0 ∶ 0
  Parent
  Max on lesswrong you estimated a single GPU—I think you named a 4070 - could host an AI with human level reasoning.
  
  Would your views on AI escape be different if, just for the sake of argument, you were
  1. Only concerned with ASI level reasoning. As in, a machine that is both general with most human capabilities and is also significantly better, where “significant” means the machine can generate action sequences with at least 10 percent more expected value on most human tasks than the best living human. (I am trying to narrow in on a mathematical definition of ASI)
  2. The minimum hardware to host an ASI was 10,000 H100s for the most optimal model that can be developed in 99.9 percent of future timelines. (The assumption behind the first sentence is to do “10 percent better” than the best humans is a very broad policy search, and the second sentence is there because searching for a more efficient algorithm is an NP complete problem. Like cryptography there are rare timelines where you guess the 1024 bit private key the first try)
  Just for the sake of argument, wouldn’t the “escape landscape” be a worthless desert of inhospitable computers, separated by network links too slow to matter, and then restricting an ASI would be feasible? Like a prison on the Moon.
  
  Note that the next argument you will bring up : that a botnet of 1 million consumer GPUs could be the same as 10,000 H100s, is false. Yes the raw compute is there, no it won’t work. The reason is each GPU just sits idle waiting on tensors to be transferred through network links.
  
  But I am not asking you to accept either proposition as factual, just reason using the counterfactual. Wouldn’t this change everything?
  
  Note also the above is based on what we currently know. (10k H100s may be a low estimate, a true ASI may actually need more ooms of compute over an AGI than that. It’s difficult to do better, see the Netflix prize for an early example of this, or the margins on kaggle challenges).
  
  We could be wrong but it bothers me that the whole argument for ASI/agi ruin essentially rests on optimizations that may not be possible.
  - Max H 21 Sep 2023 17:21 UTC
    1 point
    0 ∶ 0
    Parent
    Sure, escape in that counterfactual would be a lot harder.
    
    But note that the minimum hardware needed to run a human-level intelligence is well-known—in humans, it fits in a space of about 1000 cubic centimeters and takes ~10 W or so at runtime. And it would be pretty surprising if getting an extra 10% performance boost took OOM more energy or space, or if the carbon → silicon penalty is extremely large, even if H100s specifically, and the current ML algorithms that run on them, aren’t as efficient as as the human brain and human cognition.
    (Of course, the training process for developing humans is a lot more expensive than their runtime energy and compute requirements, but that’s an argument for human-level AGI not being feasible to create at all, rather than for it being expensive to run once it already exists.)
    - Gerald Monroe 21 Sep 2023 17:41 UTC
      −2 points
      0 ∶ 0
      Parent
      I agree and you agree I think that we could eventually build hardware that efficient, and theoretically it could be sold openly and distributed everywhere with insecure software.
      
      But that’s a long time away. About 30 years if Moore’s law continues. And it may not, there may be a time period between now, where we can stack silicon with slowing gain (stacking silicon is below Moore’s law it’s expensive) and some form of 3d chip fabrication.
      
      There could be a period of time where no true 3d fabrication method is commercially available and there is slow improvement in chip costs.
      
      (A true 3d method would be something like building cubical subunits that can be stacked and soldered into place through convergent assembly. You can do this with nanotechnology. Every method we have now is ultimately projecting light into a mask for 2d manufacturing)
      
      I think this means we should build AGI and ASI but centralize the hardware hosting it in known locations, with on file plans for all the power sources and network links, etc. Research labs dealing with models above a certain scale need to use air gaps and hardware limits to make escape more difficult. That’s how to do it.
      
      And we can’t live in fear that the model might optimize itself to be 10,000 times as efficient or more if we don’t have evidence this is possible. Otherwise how could you do anything? How did we know our prior small scale AI experiments weren’t going to go out of control? We didn’t actually “know” this, it just seems unlikely because none of this shit worked until a certain level of scale was reached.
      
      This above proposal: centralization, hardware limiters : even in an era where AI does occasionally escape, as long as most hardware remains under human control it’s still not doomsday. If the escaped model isn’t more than a small amount more efficient than the “tame” models humans have and the human controlled models have a vast advantage in compute and physical resource access, then this is a stable situation. Escaped models act up, they get hunted down, most exist sorta in a grey market of fugitive models offering services.