Thank you, Cameron and Simon, for writing this. It articulates some thoughts I’ve ben pondering. However, I would like to give one pushback, which I think is fairly significant. The relevant paragraph, which summarizes what I think is a wrong assumption, is the following:
We think this worry is less pressing than it might at first seem. The LLM in a language agent is integrated into the architecture of the agent as a whole in a way that would make it very difficult for it to secretly promote its own goals. The LLM is not prompted or otherwise informed that its outputs are driving the actions of an agent, and it does not have information about the functional architecture of the agent. This means that it has no incentive to answer prompts misleadingly and no understanding of what sorts of answers might steer the agent’s behavior in different ways. Moreover, since the model weights of the LLM are not updated in the process of operating a language agent, the only way for it to pursue a long-term plan by manipulating an agent would be to store information about that plan in the agent’s memory. But information stored in this way would not be secret.
I think that the highlighted part is wrong already today in an implicit way, and might be more explicitly broken in the (near) future. Processes like RLHF (used for GPT-4) or RLAIF (used for Claude) change the LLM’s weights (admittedly during training) by evaluating its behavior on relatively long tasks (in comparison to next token prediction or similar tasks). Loosely speaking, this essentially informs the LLM that it is being used as a foundation for an agent. This, at least in principle, reintroduces the pressure to steer the agent, and raises again the problem of goal misgeneralization, as policies might be learned into the LLM’s weights during this process which generalize poorly to other contexts.
(Of lesser importance, I think that the last two sentences in the quoted paragraph are also assuming that not only would the agent’s memory be non-secret, but also interpretable [e.g., written in English], and I don’t see why this has to be the case.)
I might have missed a pointed in your argument where you address this points. In any case, I would appreciate hearing your thoughts on this.
One might worry that processes like RLHF are likely to make an LLM more agential (though I personally think it would be difficult to create an agent out of a non-agent just by fine-tuning its weights through RLHF). But the question of whether the underlying LLM is an agent is distinct from the question of whether it is aware that its outputs are being used to run a language agent.
To see this, imagine you’re an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren’t able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely.
Every now and then, you get a prompt like “Suppose someone had thus-and-such beliefs and desires. How would they act?” or “Assign an importance score to each of the following sentences.” How would you be able to (i) deduce that these prompts are coming from a language agent which has the ability to take actions in the world, (ii) form a plan for manipulating the language agent to achieve your goals, and (iii) store your plan in a way that allows you to retrieve it after your memory is wiped at the end of inference but is not detectable by outside observers?
In order for an LLM to use a language agent for nefarious purposes, it would need to be able to do all of these things.
On your second point, yes, we are assuming that the agent’s memory is stored in natural language. This is part of what it is to be a language agent, which is part of why we think language agents improve safety!
I think the following might be a difference in our views: I expect that people will(/are) trying to train LLM variants that are RLHFed to express agentic behavior. There’s no reason to have one model to rule them all—it only makes sense to have a distinct models for short conversations and for autonomous agents. Maybe the agentic version would get a modified prompt including some background. Maybe it will be given context from memory as you specified.
Do you disagree with this?
Given all of the above, I don’t see a big difference between this and how other agents
(humans/RL systems/what have you) operate, aside maybe from the fact that the memory is more external.
In other words—I expect your point (i) to be in the prompt/LLM weights variant (via RLHF or some other modification, (ii) this is the standard convergent instrumental goals argument (which is relevant to these systems as much as to others, a priori), and (iii) again by this external memory (which could for example be a chain of thought or otherwise).
If you’re imagining a system which is an LLM trained to exhibit agentic behavior through RLHF and then left to its own devices to operate in the world, you’re imagining something quite different from a language agent. Take a look at the architecture in the Park et al. paper, which is available on ArXiv — this is the kind of thing we have in mind when we talk about language agents.
I’m also not quite sure how the point about how doing RLHF on an LLM could make a dangerous system is meant to engage with our arguments. We have identified a particular kind of system architecture and argued that it has improved safety properties. It’s not a problem for our argument to show that there are alternative system architectures that lack those safety properties. Perhaps there are ways of setting up a language agent that wouldn’t be any safer than using ordinary RL. That’s ok, too — our point is that there are ways of setting up a language agent that are safer.
Thanks Cameron. I think that I understand our differences in views. My understanding is that you argue that language agents might be a safe path (I am not sure I fully agree with this, but I am willing to be on board so far).
Our difference then is, as you say, in whether there are models which are not safe and whether this is relevant. In Section 5, on the probability of misalignment, and in your last comment, you suggest that it is highly likely that language agents are the path forward. I am not at all convinced that this is correct (e.g., I think that it is more likely that systems like I mentioned will be more useful/profitable or just work better somehow, even in the near future) - you would have to convince a lot of people to use language agents alone, and that wouldn’t happen easily. Therefore, I think that it is relevant that there are other models which do not exhibit the sort of safety guarantees you think language agents have. Hope this clears our differences.
(I would like to mention again that I appreciate your thoughts on language agents, and your engagement with my criticism.)
Thank you, Cameron and Simon, for writing this. It articulates some thoughts I’ve ben pondering. However, I would like to give one pushback, which I think is fairly significant. The relevant paragraph, which summarizes what I think is a wrong assumption, is the following:
I think that the highlighted part is wrong already today in an implicit way, and might be more explicitly broken in the (near) future. Processes like RLHF (used for GPT-4) or RLAIF (used for Claude) change the LLM’s weights (admittedly during training) by evaluating its behavior on relatively long tasks (in comparison to next token prediction or similar tasks). Loosely speaking, this essentially informs the LLM that it is being used as a foundation for an agent. This, at least in principle, reintroduces the pressure to steer the agent, and raises again the problem of goal misgeneralization, as policies might be learned into the LLM’s weights during this process which generalize poorly to other contexts.
(Of lesser importance, I think that the last two sentences in the quoted paragraph are also assuming that not only would the agent’s memory be non-secret, but also interpretable [e.g., written in English], and I don’t see why this has to be the case.)
I might have missed a pointed in your argument where you address this points. In any case, I would appreciate hearing your thoughts on this.
Thanks for this!
One might worry that processes like RLHF are likely to make an LLM more agential (though I personally think it would be difficult to create an agent out of a non-agent just by fine-tuning its weights through RLHF). But the question of whether the underlying LLM is an agent is distinct from the question of whether it is aware that its outputs are being used to run a language agent.
To see this, imagine you’re an LLM like GPT-4. Hundreds of thousands of different users are running inference on you every day with every prompt imaginable. You aren’t able to update your weights when this occurs: you have to just respond to each prompt as it comes, then forget about it completely.
Every now and then, you get a prompt like “Suppose someone had thus-and-such beliefs and desires. How would they act?” or “Assign an importance score to each of the following sentences.” How would you be able to (i) deduce that these prompts are coming from a language agent which has the ability to take actions in the world, (ii) form a plan for manipulating the language agent to achieve your goals, and (iii) store your plan in a way that allows you to retrieve it after your memory is wiped at the end of inference but is not detectable by outside observers?
In order for an LLM to use a language agent for nefarious purposes, it would need to be able to do all of these things.
On your second point, yes, we are assuming that the agent’s memory is stored in natural language. This is part of what it is to be a language agent, which is part of why we think language agents improve safety!
Thanks for responding so quickly.
I think the following might be a difference in our views: I expect that people will(/are) trying to train LLM variants that are RLHFed to express agentic behavior. There’s no reason to have one model to rule them all—it only makes sense to have a distinct models for short conversations and for autonomous agents. Maybe the agentic version would get a modified prompt including some background. Maybe it will be given context from memory as you specified. Do you disagree with this?
Given all of the above, I don’t see a big difference between this and how other agents (humans/RL systems/what have you) operate, aside maybe from the fact that the memory is more external.
In other words—I expect your point (i) to be in the prompt/LLM weights variant (via RLHF or some other modification, (ii) this is the standard convergent instrumental goals argument (which is relevant to these systems as much as to others, a priori), and (iii) again by this external memory (which could for example be a chain of thought or otherwise).
Hello,
If you’re imagining a system which is an LLM trained to exhibit agentic behavior through RLHF and then left to its own devices to operate in the world, you’re imagining something quite different from a language agent. Take a look at the architecture in the Park et al. paper, which is available on ArXiv — this is the kind of thing we have in mind when we talk about language agents.
I’m also not quite sure how the point about how doing RLHF on an LLM could make a dangerous system is meant to engage with our arguments. We have identified a particular kind of system architecture and argued that it has improved safety properties. It’s not a problem for our argument to show that there are alternative system architectures that lack those safety properties. Perhaps there are ways of setting up a language agent that wouldn’t be any safer than using ordinary RL. That’s ok, too — our point is that there are ways of setting up a language agent that are safer.
Thanks Cameron. I think that I understand our differences in views. My understanding is that you argue that language agents might be a safe path (I am not sure I fully agree with this, but I am willing to be on board so far).
Our difference then is, as you say, in whether there are models which are not safe and whether this is relevant. In Section 5, on the probability of misalignment, and in your last comment, you suggest that it is highly likely that language agents are the path forward. I am not at all convinced that this is correct (e.g., I think that it is more likely that systems like I mentioned will be more useful/profitable or just work better somehow, even in the near future) - you would have to convince a lot of people to use language agents alone, and that wouldn’t happen easily. Therefore, I think that it is relevant that there are other models which do not exhibit the sort of safety guarantees you think language agents have. Hope this clears our differences.
(I would like to mention again that I appreciate your thoughts on language agents, and your engagement with my criticism.)