Firstly, I don’t see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.
Ah, I may have dropped some connective text. I’m saying that being “easy to control” is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)
“AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data[1] that gives it the capabilities and inclination to escape.
Two points:
I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it’s not in the training data. (Maybe GPT-based LLMs specifically won’t be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.
Inclination is another matter, but if an AGI isn’t capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don’t want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you’ll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)
Also, current AI systems are already more than just function approximators—strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.
There’s often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.
But more complicated embodiments (AutoGPT, the one in ARC’s evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.
I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn’t aligned because it’s too weak or is biding its time. It’s aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don’t instrumentally hide their unaligned behaviors.
Whether GPT-4 is “aligned” or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.
Ah, I may have dropped some connective text. I’m saying that being “easy to control” is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)
Two points:
I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it’s not in the training data. (Maybe GPT-based LLMs specifically won’t be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.
Inclination is another matter, but if an AGI isn’t capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don’t want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you’ll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)
Also, current AI systems are already more than just function approximators—strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.
There’s often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.
But more complicated embodiments (AutoGPT, the one in ARC’s evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.
Whether GPT-4 is “aligned” or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.