Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

Erich_Grunewald 🔸Jul 30, 2022, 3:42 PM
15 points
0 ∶ 0
I take this post to argue that, just as an AGI’s alignment property won’t generalise well out-of-distribution, its ability to actually do things, i.e. achieve its goals, also won’t generalise well out-of-distribution. Does that seem like a fair (if brief) summary?

As an aside, I feel like it’s more fruitful to talk about specific classes of defects rather than all of them together. You use the word “bug” to mean everything from divide by zero crashes to wrong beliefs which leads you to write things like “the inherent bugginess of AI is a very good thing for AI safety”, whereas the entire field of AI safety seems to exist precisely because AIs will have bugs (i.e. deviations from desired/correct behaviour), so if anything an inherent lack of bugs in AI would be better for AI safety.
- titotal Jul 31, 2022, 3:32 PM
  4 points
  0 ∶ 0
  Parent
  Yes, that’s a fair summary. I think that perfect alignment is pretty much impossible, as is perfectly rational/bug-free AI. I think the latter fact may give us enough breathing room to get alignment at least good enough to avert extinction.
  I feel like it’s more fruitful to talk about specific classes of defects rather than all of them together. You use the word “bug” to mean everything from divide by zero crashes to wrong beliefs
  That’s fair, I think if people were to further explore this topic it would make sense to separate them out. And good point about the bugginess passage, i’ve edited it to be more accurate.
Zach Stein-Perlman Jul 30, 2022, 2:12 PM
13 points
0 ∶ 0
The belief is that as soon as we create an AI with at least human-level general intelligence, it will be relatively easy to use it’s superior reasoning, extensive knowledge, and superhuman thinking speed to take over the world.
This depends on what “human-level” means. There is some threshold such that an AI past that threshold could quickly take over the world, and it doesn’t really matter whether we call that “human-level” or not.
overall it seems like “make AI stupid” is a far easier task than “make the AI’s goals perfectly aligned”.
Sure. But the relevant task isn’t make something that won’t kill you. It’s more like make something that will stop any AI from killing you, or maybe find a way to do alignment without much cost and without sacrificing much usefulness. If you and I make stupid AI, great, but some lab will realize that non-stupid AI could be more useful, and will make it by default.
- NickGabs Aug 4, 2022, 5:41 PM
  4 points
  0 ∶ 0
  Parent
  This is very true. However, the OP’s point still helps us, as AI that is simultaneously smart enough to be useful in a narrow domain, misaligned, but also too stupid to take over the world could help us reduce xrisk. In particular, if it is superhumanly good at alignment research, then it could output good alignment research as part of its deception phase. This would help reduce the risk from future AI’s significantly without causing xrisk as, ex hypothesi, the AI is too stupid to take over. The main question here is whether an AI could be smart enough to do very good alignment research and also too stupid to take over the world if it tried. I am skeptical but pretty uncertain, so I would give it at least a 10% chance of being true, and maybe higher.
- titotal Jul 31, 2022, 3:26 PM
  2 points
  0 ∶ 0
  Parent
  This depends on what “human-level” means. There is some threshold such that an AI past that threshold could quickly take over the world, and it doesn’t really matter whether we call that “human-level” or not.
  Indeed, this post is not an attempt to argue that AGI could never be a threat, merely that the “threshold for subjugation” is much higher than “any AGI”, as many people imply. Human-level is just a marker for a level of intelligence that most people will agree counts as AGI, but (due to mental flaws) is most likely not capable of world domination. For example, I do not believe an AI brain upload of bobby fischer could take over the world.
  This makes a difference, because it means that the world in which the actual x-risk AGI comes into being is one in which a lot of earlier, non-deadly AGI already exist and can be studied, or used against the rogue.
  Sure. But the relevant task isn’t make something that won’t kill you. It’s more like make something that will stop any AI from killing you, or maybe find a way to do alignment without much cost and without sacrificing much usefulness. If you and I make stupid AI, great, but some lab will realize that non-stupid AI could be more useful, and will make it by default.
  Current narrow machine learning AI is extraordinarily stupid at things it isn’t trained for, and yet it still is massively funded and incredibly powerful. Nobody is hankering to put a detailed understanding of quantum mechanics into Dall-E. A “stupidity about world domination” module, focused on a few key dangerous areas like biochemistry, could potentially be implemented into most AI’s without affecting performance at all. Wouldn’t solve the problem entirely, but it would help mitigate risk.
  Alternatively, if you want to “make something that will stop AI from killing us” (presumably an AGI), you need to make sure that it can’t kill us instead, and that could also be helped by deliberate flaws and ignorance. So make it an idiot savant at terminating AI’s, but not at other things.
Davidmanheim Jul 31, 2022, 10:26 AM
9 points
0 ∶ 0
Buy the argument or don’t, but this is a straw man.

Yeah, the first version will be a buggy mess, but the argument is that first version that runs well enough to do anything will be debugged enough to be a threat. The mistake here is to claim that “first AGI” is going to be the final version—that’s not what happens with software, and iteration—even if it’s over a couple years—is far faster than our realization of a potential problem. And the claim is that things start going wrong will be after enough bugs have been worked out, and then it will be too late.
- titotal Jul 31, 2022, 2:42 PM
  7 points
  0 ∶ 0
  Parent
  So, I think there is a threshold of intelligence and bug-free-ness (which i’ll just call rationality) that will allow an AI to escape and attempt to attack humanity.
  I also think there is a threshold of intelligence and rationality that could allow an AI to actually succeed in subjugating us all.
  I believe that the second threshold is much, much higher than the first, and we would expect to see huge numbers of AI versions that pass the first threshold but not the second. If a pre-alpha build is intelligent enough to escape, they will be the first builds to attack.
  Even if we’re looking at released builds though, those builds will only be debugged within specific domains. Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.
  - Davidmanheim Aug 3, 2022, 10:33 AM
    5 points
    0 ∶ 0
    Parent
    Note: The below is all speculative—I’m much more interested in pushing back against your seeming confidence in your model than saying I’m confident in the opposite. In fact, I think there are ways to avoid many of the failure modes, which safety researchers are pioneering now—I just don’t think we should be at all confident they work, and should be near-certain they won’t happen by default.
    
    That said, I don’t agree that it’s obvious that the two thresholds you mention are far apart, on the relevant scale—though how exactly to construct the relevant scale is unclear. And even if they are far apart, there are reasons to worry.
    
    The first point, that the window is likely narrow, is because near-human capability is a very narrow window, for many or most domains we have managed to be successful in with ML. For example, moving from “beat some good Go players” to “unambiguously better than the best living players” was a few months.
    
    The second point is that I think that the jump from “around human competence” to “smarter than most / all humans” is plausibly closely related to both how much power we will end up giving systems, and (partly as a consequence,) how likely they are to end up actually trying to attack in some non-trivial way. And this point is based on my intuitive understanding of why very few humans attempt to do anything which will cause them to be jailed. Even psychopaths who don’t actually care about the harm being caused wait until they are likely to get away with something. Lastly and relatedly, once humans reach a certain educational level, you don’t need to explicitly train people to reason in specific domains—they find books and build inter-domain knowledge on their own. I don’t see a clear reason to expect AGI to work differently, once they are, in fact, generally capable at the level of smarter-than-almost-all-humans. And whether that gap in narrow or wide, and whether takes minutes, or a decade, the critical concern is that we might not see misalignment of the most worrying kinds until after we are on the far end of the gap.
    - NickGabs Aug 4, 2022, 5:53 PM
      6 points
      0 ∶ 0
      Parent
      I think the OP’s argument depends on the idea that “Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.” If AI’s have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/bugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AI’s could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/text outputting to escape a box) while failing to make as much progress in the more OOD domains.
      - Davidmanheim Aug 8, 2022, 12:31 PM
        2 points
        0 ∶ 0
        Parent
        Part of my second point is that smart people figure out for themselves what they need to know in new domains, and my definition of “general intelligence” there is little reason to think an AGI will be different. The analogies to ANI with domain specific knowledge which doesn’t generalize well seems to ignore this—though I agree it’s a reason to be slightly less worried that ANI systems could scale in ways that pose risks, without developing generalized intelligence first.
        NickGabs Aug 8, 2022, 7:09 PM
        3 points
        0 ∶ 0
        Parent
        I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.
constructive Aug 1, 2022, 7:06 AM
5 points
0 ∶ 0
Interesting perspective. Though leaning on Cotra’s recent post, if the first AGI will be developed by iterations of reinforcement learning in different domains, it seems likely that will develop a rather accurate view of the world, as that will give the highest rewards. This means the AGI will have high situational awareness. I.e., it will know that it’s an AGI and it will very likely know about human biases. I thus think it will also be aware that it contains mental bugs itself and may start actively trying to fix them (since that will be reinforced as it gives higher rewards in the longer run).
I thus think that we should expect it to contain a surprisingly low number of very general bugs such as weird ways of thinking or false assumptions in its worldview.
That’s why I believe the first AGI will already be very capable and smart enough to hide for a long time until it strikes and overthrows its owners.
- titotal Aug 1, 2022, 2:06 PM
  1 point
  0 ∶ 0
  Parent
  Yeah, i guess another consequence of how bugs are distributed is that the methodology of AI development matters a lot. An AI that is trained and developed over huge numbers of different domains is far, far, far more likely to succeed at takeover than one trained for specific purposes such as solving math problems. So the HFDT from that post would definitely be of higher concern if it worked (although I’m skeptical that it would).
  I do think that any method of training will still leave holes, however. For example, the scenario where HFDT is trained by looking at how experts use a computer would leave out all the other non-computer domains of expertise. So even if it was a perfect reasoner for all scientific, artistic and political knowledge, you couldn’t just shove it in a robot body and expect it do a backflip on it’s first try, no matter how many backflipping manuals it had read. I think there will be sufficently many outside domain problems to stymy world domination attempts, at least initially.
  I think a main difference of opinion I have with AI risk people is that I think subjugating all of humanity is a near impossibly hard task, requiring a level of intelligence and perfection across a range of fields that is stupendously far above human level, and I don’t think it’s possible to reach that level without vast, vast amounts of empirical testing.
  - constructive Aug 2, 2022, 7:40 PM
    2 points
    0 ∶ 0
    Parent
    Agree that it depends a lot on the training procedure. However, I think that given high situational awareness, we should expect the AI to know its shortcomings very well.
    
    So I agree that it won’t be able to do a backflip on the first try. But it will know that it would likely fail and thus not rely on plans that require backflips or if it needs backflips it will find a way of learning them without being suspicious. (I.e. by manipulating a human into training it to learn backflips)
    
    I think overthrowing humanity is certainly hard. But it still seems possible for a patient AGI that slowly accumulates wealth and power by exploiting human conflicts, getting involved in crucial economic processes, and potentially gaining control of communication systems in the military with deepfakes & the wealth and power it has accumulated. (And all this can be done by just interacting with a computer interface as in Cotra’s example) It’s also fairly likely that there are some exploits in the way humans work that we are not aware of that the AGI would learn from being trained with tons of data that would make it even easier.
    So overall, I agree the AGI will have bugs, but it will also know it likely has bugs and thus will be very careful with any attempts at overthrowing humanity.
    - titotal Aug 3, 2022, 1:56 PM
      2 points
      0 ∶ 0
      Parent
      So I think my most plausible scenario of AI success would be similar to yours: You build up wealth and power through some sucker corporation or small country that thinks it controls you, then use their R&D resources along with your intelligence to develop some form of world-destruction level technology that can be deployed without resistance. I think this is orders of magnitudes more likely to work than yudkowsky’s ridiculous “make a nanofactory in a beaker from first principles” strategy.
      I still think this plan is doomed to fail (for early AGI). It’s multistep, highly complicated, and requires interactions with a lot of humans, who are highly unpredictable. You really can’t avoid “backflip steps” in such a process. By that I mean, there will be things it needs to do that there are not sufficient data available to perfect, that it just has to roll the dice on. For example, there is no training set for “running a secret globe-spanning conspiracy”, so it will inevitably make mistakes there. If we discover it before it’s ready to defeat us, it loses. Also, by the time it pulls the trigger on it’s plan, there will be other AGI’s around, and other examples of failed attacks that put humanity on alert.
      - NickGabs Aug 4, 2022, 5:57 PM
        1 point
        0 ∶ 0
        Parent
        A key crux here seems to be your claim that AI’s will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.
Mjreard Aug 4, 2022, 9:38 PM
3 points
0 ∶ 0
Narrow AIs have moved from buggy/mediocre to hyper-competent very quickly (months). If early AGIs are widely copied/escaped, the global resolve and coordination required to contain them would be unprecedented in breadth and speed.
I expect warning shots, and expect them to be helpful (vs no shots), but take very little comfort in that.
- titotal Aug 5, 2022, 1:09 PM
  1 point
  0 ∶ 0
  Parent
  They’ve learned within months for certain problems where learning can be done at machine speeds, ie game-like problems where it can “play against itself” or problems where huge amounts of data are available in machine-friendly format. But that isn’t the case for every application. For example, developing self driving cars up to perfection level has taken way, way longer than expected, partially because it has to deal with freak events that are outside the norm, so a lot more experience and data has to be built up, which takes human time. (of course, humans are also not great at freak events, but remember we’re aiming for perfection here). I think most tasks involved in taking over the world will look a lot more like self-driving cars than playing Go, which inevitably means mistakes, and a lot of them.
Yitz 31 Jul 2022 18:55 UTC
2 points
0 ∶ 0
I strongly agree with you on points one and two, though I’m not super confident on three. For me the biggest takeaway is we should be putting more effort into attempts to instill “false” beliefs which are safety-promoting and self-stable.
- Greg_Colbourn ⏸️ 1 Dec 2022 11:40 UTC
  3 points
  0 ∶ 0
  Parent
  I could see this backfiring. What if insilling false beliefs just later led to the meta-belief that deception is useful for control?
  - Yitz 15 Dec 2022 19:02 UTC
    1 point
    0 ∶ 0
    Parent
    that’s a fair point, I’m reconsidering my original take.