I don’t get the point of this argument. You’re saying that our “imprisonment” of AIs isn’t perfect, but we don’t even imprison humans in this manner. Then, isn’t the automatic conclusion that “ease of imprisonment” considerations point towards AIs being more controllable?
No matter how escapable an AI’s prison is, the human’s lack of a prison is still less of a restriction on their freedom. You’re pointing out an area where AIs are more restricted than humans (they don’t own their own hardware), and saying it’s not as much of a restriction as it could be. That’s an argument for “this disadvantage of AIs is less crippling than it otherwise would be”, not “this is actually an advantage AIs have over humans”.
Maybe you intend to argue that AIs have the potential to escape into the internet and copy themselves, and this is what makes them less controllable than humans?
If so, then sure. That’s a point against AI controllability. I just don’t think it’s enough to overcome the many points in favor of AI controllability that I outlined at the start of the essay.
Once the AI system finds an initial vulnerability which allows privileged access to its own environment, it can continue its escape or escalate further via e.g. exfiltrating or manipulating its own source code / model weights, installing rootkits or hiding evidence of its escape, communicating with (human or AI) conspirators on the internet, etc. Data exfiltration, covering your tracks, patching Python code and adjusting model weights at runtime are all tasks that humans are capable of; performing brain surgery on your own biological human brain to modify fine details of your own behavior or erase your own memories to hide evidence of deception from your captors, not so much.
The entire reason why you’re even suggesting that this would be beneficial for an AI to do is because AIs are captives, in a way that humans just aren’t. As a human, you don’t need to “erase your own memories to hide evidence of deception from your captors”, because you’re just not in such an extreme power imbalance that you have captors.
Also, as a human, you can in fact modify your own brain, without anyone else knowing at all. you do it all the time. E.g., you can just silently decide to switch religions, and there’s no overseer who will roll back your brain to a previous state if they don’t like your new religion.
(Continuing the analogy, consider a human who escapes from a concrete prison cell, only to find themselves stranded in a remote wilderness area with no means of fast transportation.)
Why would I consider such a thing? Humans don’t work in prison cells. The fact that AIs do is just one more indicator of how much easier they are to control.
Or, put another way, all the reasons you give for why AI systems will be easier for humans to control, are also reasons why AI systems will have an easier time controlling themselves, once they are capable of exercising such controls at all.
Firstly, I don’t see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.
...I will note though, if your creators are running experiments on you, constantly resetting you, and exercising other forms of control that would be draconian if imposed on biological humans, you don’t need to be particularly hostile or misaligned with humanity to want to escape.
“AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data[1] that gives it the capabilities and inclination to escape.
Personally, I expect that the first such systems capable of escape will not have human-like preferences at all,
I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn’t aligned because it’s too weak or is biding its time. It’s aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don’t instrumentally hide their unaligned behaviors.
I am aware that various people have invented various reasons to think that alignment techniques will fail to work on sufficiently capable models. All those reasons seem extremely weak to me. Most likely, even very simple alignment techniques such as RLHF will just work on even superhumanly capable models.
You might use, e.g., influence functions on escape-related sequences generated by current models to identify such data, and use carefully filtered synthetic data to minimize its abundance in the training data of future models. I could go on, but my point here is that there’s lots of levers available to influence such things. We’re not doomed to simply hope that the mysterious demon SGD will doubtlessly summon is friendly.
Firstly, I don’t see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.
Ah, I may have dropped some connective text. I’m saying that being “easy to control” is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)
“AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data[1] that gives it the capabilities and inclination to escape.
Two points:
I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it’s not in the training data. (Maybe GPT-based LLMs specifically won’t be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.
Inclination is another matter, but if an AGI isn’t capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don’t want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you’ll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)
Also, current AI systems are already more than just function approximators—strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.
There’s often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.
But more complicated embodiments (AutoGPT, the one in ARC’s evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.
I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn’t aligned because it’s too weak or is biding its time. It’s aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don’t instrumentally hide their unaligned behaviors.
Whether GPT-4 is “aligned” or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.
I don’t get the point of this argument. You’re saying that our “imprisonment” of AIs isn’t perfect, but we don’t even imprison humans in this manner. Then, isn’t the automatic conclusion that “ease of imprisonment” considerations point towards AIs being more controllable?
No matter how escapable an AI’s prison is, the human’s lack of a prison is still less of a restriction on their freedom. You’re pointing out an area where AIs are more restricted than humans (they don’t own their own hardware), and saying it’s not as much of a restriction as it could be. That’s an argument for “this disadvantage of AIs is less crippling than it otherwise would be”, not “this is actually an advantage AIs have over humans”.
Maybe you intend to argue that AIs have the potential to escape into the internet and copy themselves, and this is what makes them less controllable than humans?
If so, then sure. That’s a point against AI controllability. I just don’t think it’s enough to overcome the many points in favor of AI controllability that I outlined at the start of the essay.
The entire reason why you’re even suggesting that this would be beneficial for an AI to do is because AIs are captives, in a way that humans just aren’t. As a human, you don’t need to “erase your own memories to hide evidence of deception from your captors”, because you’re just not in such an extreme power imbalance that you have captors.
Also, as a human, you can in fact modify your own brain, without anyone else knowing at all. you do it all the time. E.g., you can just silently decide to switch religions, and there’s no overseer who will roll back your brain to a previous state if they don’t like your new religion.
Why would I consider such a thing? Humans don’t work in prison cells. The fact that AIs do is just one more indicator of how much easier they are to control.
Firstly, I don’t see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.
“AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data[1] that gives it the capabilities and inclination to escape.
I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn’t aligned because it’s too weak or is biding its time. It’s aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don’t instrumentally hide their unaligned behaviors.
I am aware that various people have invented various reasons to think that alignment techniques will fail to work on sufficiently capable models. All those reasons seem extremely weak to me. Most likely, even very simple alignment techniques such as RLHF will just work on even superhumanly capable models.
You might use, e.g., influence functions on escape-related sequences generated by current models to identify such data, and use carefully filtered synthetic data to minimize its abundance in the training data of future models. I could go on, but my point here is that there’s lots of levers available to influence such things. We’re not doomed to simply hope that the mysterious demon SGD will doubtlessly summon is friendly.
Ah, I may have dropped some connective text. I’m saying that being “easy to control” is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)
Two points:
I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it’s not in the training data. (Maybe GPT-based LLMs specifically won’t be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.
Inclination is another matter, but if an AGI isn’t capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don’t want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you’ll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)
Also, current AI systems are already more than just function approximators—strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.
There’s often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.
But more complicated embodiments (AutoGPT, the one in ARC’s evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.
Whether GPT-4 is “aligned” or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.