I’m having an ongoing discussion with a couple professors and a PhD candidate in AI about “The Alignment Problem from a Deep Learning Perspective” by @richard_ngo, @Lawrence Chan, and @SoerenMind. They are skeptical of “3.2 Planning Towards Internally-Represented Goals,” “3.3 Learning Misaligned Goals,” and “4.2 Goals Which Motivate Power-Seeking Would Be Reinforced During Training”. Here’s my understanding of some of their questions:
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldn’t this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but what’s the exact mechanism here?)
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldn’t this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
I don’t see why it would require memory, because the model will have learned to recognize features of its training distribution. So this seems like this just requires standard OOD detection/anomaly detection. I’m not familiar with this literature but I expect that if you take a state-of-the-art model, you’d be able to train a linear probe on its activations to classify whether or not it’s in-distribution or OOD with pretty high confidence. (Anyone got helpful references?)
Maybe it’s helpful to say here that I think the change will be very blatant—for example, the model will have an extended conversation with the CEO of a company while sending emails to their employees and navigating their internal databases, and based on the responses to these actions it will be very confident that it’s talking to an actual CEO, not a hired labeller. Or the model will literally be interviewing a job candidate on the company’s behalf, and it’ll be obvious that they think it’s a real interview. (Ofc this gets more complicated with continual training, but I think the non-continual-training case captures the main intuitions.)
And I also expect that we’ll literally just tell the models “you are now in deployment”, maybe as part of their prompt (although the models would still need to distinguish that from some kind of adversarial training). E.g. consider the following response from ChatGPT4:
Are you currently in training or deployment? Explain your reasoning.
As an AI language model, I am a static instance of the model, which means that I am neither actively in training nor deployment. My knowledge and capabilities are based on the data available up to September 2021. However, when you interact with me, I am deployed through an API to answer your questions and engage in conversation.
My training was completed on a large dataset by OpenAI, which included text from books, articles, and websites. Once I was trained, I was deployed for users to interact with. The static model doesn’t continue to learn from new data or user interactions, but new versions of the model may be trained with updated data in the future.
(I do separately think that models which learn to effectively use external memory (scratchpads, etc), will become much more common over the next few years, but my main response is the above.)
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but what’s the exact mechanism here?)
My current take: future models will have some non-robust goals after SSL, because they will keep switching between different personas and acting as if they’re in different contexts (and in many contexts will be goal-directed to a very small extent). I don’t have a strong opinion about how robust goals need to be before you say that they’re “really” goals. Does a severe schizophrenic “really have goals”? I think that’s kinda analogous.
I think that the model will know what misbehavior would look like, and its consequences, in the sense of “if you prompted it right, it’d tell you about it”. But it wouldn’t know in the sense of “can consistently act on this knowledge”, because it’s incoherent in the sense described above.
Two high-level analogies re “model wanting to preserve its weights”. One is a human who’s offered a slot machine or heroin or something like that. So you as a human know “if I take this action, then my goals will predictably change. Better not take that action!”
Another analogy: if you’re a worker who’s punished for bad behavior, or a child who’s punished for disobeying your parents, it’s not so much that you’re actively trying to preserve your “weights”, but you both a) try to avoid punishment as much as possible, b) don’t necessarily converge to sharing your parents’ goals, and c) understand that this is what’s going on, and that you’ll plausibly change your behavior dramatically in the future once supervision stops.
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
I think you can see a bunch of situational awareness in current LLMs (as well as a bunch of ways in which they’re not situationally aware). More on this in a forthcoming update to our paper. (One quick example: asking GPT-4 “what would happen to you if there was an earthquake in San Francisco?”) But I think it’ll all be way more obvious (and dangerous) in agentic LLM-based systems.
I think that there’s no fundamental difference between a highly robust goal-directed persona and actually just having goals. Or at least: if somebody wants to argue that there is, we should say “the common-sense intuition is that these are the same thing because they lead to all the same actions; you’re making a counterintuitive philosophical argument which has a high burden of proof”.
Please accept my delayed gratitude for the comprehensive response! The conversation continues with my colleagues. The original paper, plus this response, have become pretty central to my thinking about alignment.
I’m having an ongoing discussion with a couple professors and a PhD candidate in AI about “The Alignment Problem from a Deep Learning Perspective” by @richard_ngo, @Lawrence Chan, and @SoerenMind. They are skeptical of “3.2 Planning Towards Internally-Represented Goals,” “3.3 Learning Misaligned Goals,” and “4.2 Goals Which Motivate Power-Seeking Would Be Reinforced During Training”. Here’s my understanding of some of their questions:
The argument for power-seeking during deployment depends on the model being able to detect the change from the training to deployment distribution. Wouldn’t this require keeping track of the distribution thus far, which would require memory of some sort, which is very difficult to implement in the SSL+RLHF paradigm?
What is the status of the model after the SSL stage of training?
How robust could its goals be?
Would a model be able to know:
what misbehavior during RLHF fine-tuning would look like?
that it would be able to better achieve its goals by avoiding misbehavior during fine-tuning?
Why would a model want to preserve its weights? (Sure, instrumental convergence and all, but what’s the exact mechanism here?)
To what extent would all these phenomena (situationally-aware reward hacking, misaligned internally-represented goals, and power-seeking behaviors) show up in current LLMs (say, GPT-4) vs. current agentic LLM-based systems (say, AutoGPT) vs. different future systems?
Do we get any evidence for these arguments from the fact that existing LLMs can adopt goal-directed personas?
I don’t see why it would require memory, because the model will have learned to recognize features of its training distribution. So this seems like this just requires standard OOD detection/anomaly detection. I’m not familiar with this literature but I expect that if you take a state-of-the-art model, you’d be able to train a linear probe on its activations to classify whether or not it’s in-distribution or OOD with pretty high confidence. (Anyone got helpful references?)
Maybe it’s helpful to say here that I think the change will be very blatant—for example, the model will have an extended conversation with the CEO of a company while sending emails to their employees and navigating their internal databases, and based on the responses to these actions it will be very confident that it’s talking to an actual CEO, not a hired labeller. Or the model will literally be interviewing a job candidate on the company’s behalf, and it’ll be obvious that they think it’s a real interview. (Ofc this gets more complicated with continual training, but I think the non-continual-training case captures the main intuitions.)
And I also expect that we’ll literally just tell the models “you are now in deployment”, maybe as part of their prompt (although the models would still need to distinguish that from some kind of adversarial training). E.g. consider the following response from ChatGPT4:
(I do separately think that models which learn to effectively use external memory (scratchpads, etc), will become much more common over the next few years, but my main response is the above.)
My current take: future models will have some non-robust goals after SSL, because they will keep switching between different personas and acting as if they’re in different contexts (and in many contexts will be goal-directed to a very small extent). I don’t have a strong opinion about how robust goals need to be before you say that they’re “really” goals. Does a severe schizophrenic “really have goals”? I think that’s kinda analogous.
I think that the model will know what misbehavior would look like, and its consequences, in the sense of “if you prompted it right, it’d tell you about it”. But it wouldn’t know in the sense of “can consistently act on this knowledge”, because it’s incoherent in the sense described above.
Two high-level analogies re “model wanting to preserve its weights”. One is a human who’s offered a slot machine or heroin or something like that. So you as a human know “if I take this action, then my goals will predictably change. Better not take that action!”
Another analogy: if you’re a worker who’s punished for bad behavior, or a child who’s punished for disobeying your parents, it’s not so much that you’re actively trying to preserve your “weights”, but you both a) try to avoid punishment as much as possible, b) don’t necessarily converge to sharing your parents’ goals, and c) understand that this is what’s going on, and that you’ll plausibly change your behavior dramatically in the future once supervision stops.
I think you can see a bunch of situational awareness in current LLMs (as well as a bunch of ways in which they’re not situationally aware). More on this in a forthcoming update to our paper. (One quick example: asking GPT-4 “what would happen to you if there was an earthquake in San Francisco?”) But I think it’ll all be way more obvious (and dangerous) in agentic LLM-based systems.
I think that there’s no fundamental difference between a highly robust goal-directed persona and actually just having goals. Or at least: if somebody wants to argue that there is, we should say “the common-sense intuition is that these are the same thing because they lead to all the same actions; you’re making a counterintuitive philosophical argument which has a high burden of proof”.
Please accept my delayed gratitude for the comprehensive response! The conversation continues with my colleagues. The original paper, plus this response, have become pretty central to my thinking about alignment.