If you’re imagining a system which is an LLM trained to exhibit agentic behavior through RLHF and then left to its own devices to operate in the world, you’re imagining something quite different from a language agent. Take a look at the architecture in the Park et al. paper, which is available on ArXiv — this is the kind of thing we have in mind when we talk about language agents.
I’m also not quite sure how the point about how doing RLHF on an LLM could make a dangerous system is meant to engage with our arguments. We have identified a particular kind of system architecture and argued that it has improved safety properties. It’s not a problem for our argument to show that there are alternative system architectures that lack those safety properties. Perhaps there are ways of setting up a language agent that wouldn’t be any safer than using ordinary RL. That’s ok, too — our point is that there are ways of setting up a language agent that are safer.
Thanks Cameron. I think that I understand our differences in views. My understanding is that you argue that language agents might be a safe path (I am not sure I fully agree with this, but I am willing to be on board so far).
Our difference then is, as you say, in whether there are models which are not safe and whether this is relevant. In Section 5, on the probability of misalignment, and in your last comment, you suggest that it is highly likely that language agents are the path forward. I am not at all convinced that this is correct (e.g., I think that it is more likely that systems like I mentioned will be more useful/profitable or just work better somehow, even in the near future) - you would have to convince a lot of people to use language agents alone, and that wouldn’t happen easily. Therefore, I think that it is relevant that there are other models which do not exhibit the sort of safety guarantees you think language agents have. Hope this clears our differences.
(I would like to mention again that I appreciate your thoughts on language agents, and your engagement with my criticism.)
Hello,
If you’re imagining a system which is an LLM trained to exhibit agentic behavior through RLHF and then left to its own devices to operate in the world, you’re imagining something quite different from a language agent. Take a look at the architecture in the Park et al. paper, which is available on ArXiv — this is the kind of thing we have in mind when we talk about language agents.
I’m also not quite sure how the point about how doing RLHF on an LLM could make a dangerous system is meant to engage with our arguments. We have identified a particular kind of system architecture and argued that it has improved safety properties. It’s not a problem for our argument to show that there are alternative system architectures that lack those safety properties. Perhaps there are ways of setting up a language agent that wouldn’t be any safer than using ordinary RL. That’s ok, too — our point is that there are ways of setting up a language agent that are safer.
Thanks Cameron. I think that I understand our differences in views. My understanding is that you argue that language agents might be a safe path (I am not sure I fully agree with this, but I am willing to be on board so far).
Our difference then is, as you say, in whether there are models which are not safe and whether this is relevant. In Section 5, on the probability of misalignment, and in your last comment, you suggest that it is highly likely that language agents are the path forward. I am not at all convinced that this is correct (e.g., I think that it is more likely that systems like I mentioned will be more useful/profitable or just work better somehow, even in the near future) - you would have to convince a lot of people to use language agents alone, and that wouldn’t happen easily. Therefore, I think that it is relevant that there are other models which do not exhibit the sort of safety guarantees you think language agents have. Hope this clears our differences.
(I would like to mention again that I appreciate your thoughts on language agents, and your engagement with my criticism.)