Iâm having an ongoing discussion with a couple professors and a PhD candidate in AI about âThe Alignment Problem from a Deep Learning Perspectiveâ by @richard_ngo, @Lawrence Chan, and @SoerenMind. They are skeptical of â3.2 Planning Towards Internally-Represented Goals,â â3.3 Learning Misaligned Goals,â and â4.2 Goals Which Motivate Power-Seeking Would Be Reinforced During Trainingâ. Hereâs my understanding of some of their questions:
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldnât this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but whatâs the exact mechanism here?)
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldnât this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
I donât see why it would require memory, because the model will have learned to recognize features of its training distribution. So this seems like this just requires standard OOD detection/âanomaly detection. Iâm not familiar with this literature but I expect that if you take a state-of-the-art model, youâd be able to train a linear probe on its activations to classify whether or not itâs in-distribution or OOD with pretty high confidence. (Anyone got helpful references?)
Maybe itâs helpful to say here that I think the change will be very blatantâfor example, the model will have an extended conversation with the CEO of a company while sending emails to their employees and navigating their internal databases, and based on the responses to these actions it will be very confident that itâs talking to an actual CEO, not a hired labeller. Or the model will literally be interviewing a job candidate on the companyâs behalf, and itâll be obvious that they think itâs a real interview. (Ofc this gets more complicated with continual training, but I think the non-continual-training case captures the main intuitions.)
And I also expect that weâll literally just tell the models âyou are now in deploymentâ, maybe as part of their prompt (although the models would still need to distinguish that from some kind of adversarial training). E.g. consider the following response from ChatGPT4:
Are you currently in training or deployment? Explain your reasoning.
As an AI language model, I am a static instance of the model, which means that I am neither actively in training nor deployment. My knowledge and capabilities are based on the data available up to September 2021. However, when you interact with me, I am deployed through an API to answer your questions and engage in conversation.
My training was completed on a large dataset by OpenAI, which included text from books, articles, and websites. Once I was trained, I was deployed for users to interact with. The static model doesnât continue to learn from new data or user interactions, but new versions of the model may be trained with updated data in the future.
(I do separately think that models which learn to effectively use external memory (scratchpads, etc), will become much more common over the next few years, but my main response is the above.)
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but whatâs the exact mechanism here?)
My current take: future models will have some non-robust goals after SSL, because they will keep switching between different personas and acting as if theyâre in different contexts (and in many contexts will be goal-directed to a very small extent). I donât have a strong opinion about how robust goals need to be before you say that theyâre âreallyâ goals. Does a severe schizophrenic âreally have goalsâ? I think thatâs kinda analogous.
I think that the model will know what misbehavior would look like, and its consequences, in the sense of âif you prompted it right, itâd tell you about itâ. But it wouldnât know in the sense of âcan consistently act on this knowledgeâ, because itâs incoherent in the sense described above.
Two high-level analogies re âmodel wanting to preserve its weightsâ. One is a human whoâs offered a slot machine or heroin or something like that. So you as a human know âif I take this action, then my goals will predictably change. Better not take that action!â
Another analogy: if youâre a worker whoâs punished for bad behavior, or a child whoâs punished for disobeying your parents, itâs not so much that youâre actively trying to preserve your âweightsâ, but you both a) try to avoid punishment as much as possible, b) donât necessarily converge to sharing your parentsâ goals, and c) understand that this is whatâs going on, and that youâll plausibly change your behavior dramatically in the future once supervision stops.
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
I think you can see a bunch of situational awareness in current LLMs (as well as a bunch of ways in which theyâre not situationally aware). More on this in a forthcoming update to our paper. (One quick example: asking GPT-4 âwhat would happen to you if there was an earthquake in San Francisco?â) But I think itâll all be way more obvious (and dangerous) in agentic LLM-based systems.
I think that thereâs no fundamental difference between a highly robust goal-directed persona and actually just having goals. Or at least: if somebody wants to argue that there is, we should say âthe common-sense intuition is that these are the same thing because they lead to all the same actions; youâre making a counterintuitive philosophical argument which has a high burden of proofâ.
Iâm having an ongoing discussion with a couple professors and a PhD candidate in AI about âThe Alignment Problem from a Deep Learning Perspectiveâ by @richard_ngo, @Lawrence Chan, and @SoerenMind. They are skeptical of â3.2 Planning Towards Internally-Represented Goals,â â3.3 Learning Misaligned Goals,â and â4.2 Goals Which Motivate Power-Seeking Would Be Reinforced During Trainingâ. Hereâs my understanding of some of their questions:
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldnât this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but whatâs the exact mechanism here?)
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
I donât see why it would require memory, because the model will have learned to recognize features of its training distribution. So this seems like this just requires standard OOD detection/âanomaly detection. Iâm not familiar with this literature but I expect that if you take a state-of-the-art model, youâd be able to train a linear probe on its activations to classify whether or not itâs in-distribution or OOD with pretty high confidence. (Anyone got helpful references?)
Maybe itâs helpful to say here that I think the change will be very blatantâfor example, the model will have an extended conversation with the CEO of a company while sending emails to their employees and navigating their internal databases, and based on the responses to these actions it will be very confident that itâs talking to an actual CEO, not a hired labeller. Or the model will literally be interviewing a job candidate on the companyâs behalf, and itâll be obvious that they think itâs a real interview. (Ofc this gets more complicated with continual training, but I think the non-continual-training case captures the main intuitions.)
And I also expect that weâll literally just tell the models âyou are now in deploymentâ, maybe as part of their prompt (although the models would still need to distinguish that from some kind of adversarial training). E.g. consider the following response from ChatGPT4:
(I do separately think that models which learn to effectively use external memory (scratchpads, etc), will become much more common over the next few years, but my main response is the above.)
My current take: future models will have some non-robust goals after SSL, because they will keep switching between different personas and acting as if theyâre in different contexts (and in many contexts will be goal-directed to a very small extent). I donât have a strong opinion about how robust goals need to be before you say that theyâre âreallyâ goals. Does a severe schizophrenic âreally have goalsâ? I think thatâs kinda analogous.
I think that the model will know what misbehavior would look like, and its consequences, in the sense of âif you prompted it right, itâd tell you about itâ. But it wouldnât know in the sense of âcan consistently act on this knowledgeâ, because itâs incoherent in the sense described above.
Two high-level analogies re âmodel wanting to preserve its weightsâ. One is a human whoâs offered a slot machine or heroin or something like that. So you as a human know âif I take this action, then my goals will predictably change. Better not take that action!â
Another analogy: if youâre a worker whoâs punished for bad behavior, or a child whoâs punished for disobeying your parents, itâs not so much that youâre actively trying to preserve your âweightsâ, but you both a) try to avoid punishment as much as possible, b) donât necessarily converge to sharing your parentsâ goals, and c) understand that this is whatâs going on, and that youâll plausibly change your behavior dramatically in the future once supervision stops.
I think you can see a bunch of situational awareness in current LLMs (as well as a bunch of ways in which theyâre not situationally aware). More on this in a forthcoming update to our paper. (One quick example: asking GPT-4 âwhat would happen to you if there was an earthquake in San Francisco?â) But I think itâll all be way more obvious (and dangerous) in agentic LLM-based systems.
I think that thereâs no fundamental difference between a highly robust goal-directed persona and actually just having goals. Or at least: if somebody wants to argue that there is, we should say âthe common-sense intuition is that these are the same thing because they lead to all the same actions; youâre making a counterintuitive philosophical argument which has a high burden of proofâ.