I guess it depends how superhuman we’re imagining the AGI to be, but if it was merely as intelligent as like 100 human AGI experts, it wouldn’t necessarily speed up AGI progress enormously? Plus, it would need lots of compute to run, presumably on specialized hardware, so I’m not sure it could expand itself that far without being allowed to? Perhaps its best strategy would be to play nice for the time being so that humans would voluntarily give it more compute and control over the world.
And AGIs should tend toward consequentialism via convergent instrumental goals
Hm, if an agent is consequentialist, then it will have convergent instrumental subgoals. But what if the agent isn’t consequentialist to begin with? For example, if we imagine that GPT-7 is human-level AGI, this AGI might have human-type common sense. If you asked it to get you coffee, it might try to do so in a somewhat common-sense way, without scheming about how to take over the world in the process, because humans usually don’t scheme about taking over the world or preserving their utility functions at all costs? But I don’t know if that’s right; I wonder what AI-safety experts think. Also, GPT-type AIs still seem very tricky to control, but for now that’s because their behavior is weird and unpredictable rather than because they’re scheming consequentialists.
Perhaps its best strategy would be to play nice for the time being so that humans would voluntarily give it more compute and control over the world.
This is essentially the thesis of the Deceptive Alignment section of Hubinger et al’s Risks from Learned Optimization paper, and related work on inner alignment.
Hm, if an agent is consequentialist, then it will have convergent instrumental subgoals. But what if the agent isn’t consequentialist to begin with? For example, if we imagine that GPT-7 is human-level AGI, this AGI might have human-type common sense. If you asked it to get you coffee, it might try to do so in a somewhat common-sense way, without scheming about how to take over the world in the process, because humans usually don’t scheme about taking over the world or preserving their utility functions at all costs? But I don’t know if that’s right; I wonder what AI-safety experts think.
Hi Brian! One of the projects I’m thinking of doing is basically a series of posts explaining why I think EY is right on this issue (mildly superhuman AGI would be consequentialist in the relevant sense, could take over the world, go FOOM, etc.) Do you think this would be valuable? Would it change your priorities and decisions much if you changed your mind on this issue?
Cool. :) This topic isn’t my specialty, so I wouldn’t want you to take the time just for me, but I imagine many people might find those arguments interesting. I’d be most likely to change my mind on the consequentialist issue because I currently don’t know much about that topic (other than in the case of reinforcement-learning agents, where it seems more clear how they’re consequentialist).
Regarding FOOM, given how much progress DeepMind/OpenAI/etc have been making in recent years with a relatively small number of researchers (although relying on background research and computing infrastructure provided by a much larger set of people), it makes sense to me that once AGIs are able to start contributing to AGI research, things could accelerate, especially if there’s enough hardware to copy the AGIs many times over. I think the main thing I would add is that by that point, I expect it to be pretty obvious to natsec people (and maybe the general public) that shit is about to hit the fan, so that other countries/entities won’t sit idly by and let one group go FOOM unopposed. Other countries could make military, even nuclear, threats if need be.
In general, I expect the future to be a bumpy ride, and AGI alignment looks very challenging, but I also feel like a nontrivial fraction of the world’s elite brainpower will be focused on these issues as things get more and more serious, which may reduce our expectations of how much any given person can contribute to changing how the future unfolds.
Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:
..language modelling has one crucial difference from Chess or Go or image classification. Natural language essentially encodes information about the world—the entire world, not just the world of the Goban, in a much more expressive way than any other modality ever could.[1] By harnessing the world model embedded in the language model, it may be possible to build a proto-AGI.
...
This is more a thought experiment than something that’s actually going to happen tomorrow; GPT-3 today just isn’t good enough at world modelling. Also, this method depends heavily on at least one major assumption—that bigger future models will have much better world modelling capabilities—and a bunch of other smaller implicit assumptions. However, this might be the closest thing we ever get to a chance to sound the fire alarm for AGI: there’s now a concrete path to proto-AGI that has a non-negligible chance of working.
It explains how a GPT-X could become an AGI (via world modelling). I think then things like the basic drives would take over. However, maybe it’s not the end result model that we should be looking at as dangerous, but rather the training process? A ML-based (proto-)AGI could do all sorts of dangerous (consequentialist, basic-AI-drives-y) things whilst trying to optimise for performance in training.
I think then things like the basic drives would take over.
Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren’t very consequentialist. Humans do try to preserve themselves against harm or death, but not in an “I need to take over the world to ensure I’m not killed” kind of way.
If your concern is about optimization during training, that makes sense, though I’m confused as to whether it’s dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI’s “personality” doesn’t care about how accurate its output is.
Yes, concern is optimisation during training. My intuition is along the lines of “sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it’s reward ->[foom / doom]”.
I wouldn’t say “personality” comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it’s more akin to a force of nature, a blind optimisation process.
Thanks. :) Regarding the AGI’s “personality”, what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity’s evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn’t directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.
I guess it depends how superhuman we’re imagining the AGI to be, but if it was merely as intelligent as like 100 human AGI experts, it wouldn’t necessarily speed up AGI progress enormously? Plus, it would need lots of compute to run, presumably on specialized hardware, so I’m not sure it could expand itself that far without being allowed to? Perhaps its best strategy would be to play nice for the time being so that humans would voluntarily give it more compute and control over the world.
Hm, if an agent is consequentialist, then it will have convergent instrumental subgoals. But what if the agent isn’t consequentialist to begin with? For example, if we imagine that GPT-7 is human-level AGI, this AGI might have human-type common sense. If you asked it to get you coffee, it might try to do so in a somewhat common-sense way, without scheming about how to take over the world in the process, because humans usually don’t scheme about taking over the world or preserving their utility functions at all costs? But I don’t know if that’s right; I wonder what AI-safety experts think. Also, GPT-type AIs still seem very tricky to control, but for now that’s because their behavior is weird and unpredictable rather than because they’re scheming consequentialists.
This is essentially the thesis of the Deceptive Alignment section of Hubinger et al’s Risks from Learned Optimization paper, and related work on inner alignment.
You may be interested to read more about myopic training https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training
Hi Brian! One of the projects I’m thinking of doing is basically a series of posts explaining why I think EY is right on this issue (mildly superhuman AGI would be consequentialist in the relevant sense, could take over the world, go FOOM, etc.) Do you think this would be valuable? Would it change your priorities and decisions much if you changed your mind on this issue?
Cool. :) This topic isn’t my specialty, so I wouldn’t want you to take the time just for me, but I imagine many people might find those arguments interesting. I’d be most likely to change my mind on the consequentialist issue because I currently don’t know much about that topic (other than in the case of reinforcement-learning agents, where it seems more clear how they’re consequentialist).
Regarding FOOM, given how much progress DeepMind/OpenAI/etc have been making in recent years with a relatively small number of researchers (although relying on background research and computing infrastructure provided by a much larger set of people), it makes sense to me that once AGIs are able to start contributing to AGI research, things could accelerate, especially if there’s enough hardware to copy the AGIs many times over. I think the main thing I would add is that by that point, I expect it to be pretty obvious to natsec people (and maybe the general public) that shit is about to hit the fan, so that other countries/entities won’t sit idly by and let one group go FOOM unopposed. Other countries could make military, even nuclear, threats if need be.
In general, I expect the future to be a bumpy ride, and AGI alignment looks very challenging, but I also feel like a nontrivial fraction of the world’s elite brainpower will be focused on these issues as things get more and more serious, which may reduce our expectations of how much any given person can contribute to changing how the future unfolds.
Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:
Crossposted from here
Thanks. :) Is that just a general note along the lines of what I was saying, or does it explain how a GPT-X AGI would become consequentialist?
It explains how a GPT-X could become an AGI (via world modelling). I think then things like the basic drives would take over. However, maybe it’s not the end result model that we should be looking at as dangerous, but rather the training process? A ML-based (proto-)AGI could do all sorts of dangerous (consequentialist, basic-AI-drives-y) things whilst trying to optimise for performance in training.
Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren’t very consequentialist. Humans do try to preserve themselves against harm or death, but not in an “I need to take over the world to ensure I’m not killed” kind of way.
If your concern is about optimization during training, that makes sense, though I’m confused as to whether it’s dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI’s “personality” doesn’t care about how accurate its output is.
Yes, concern is optimisation during training. My intuition is along the lines of “sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it’s reward ->[foom / doom]”.
I wouldn’t say “personality” comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it’s more akin to a force of nature, a blind optimisation process.
Thanks. :) Regarding the AGI’s “personality”, what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity’s evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn’t directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.