I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it’s only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via trial and error remains a simulated agent activated by the LLM, but the LLM is not the agent itself. Thus, the LLM keeps being a simulator and has no preferences over the output except that it corresponds to the kind of output that the agent it has been trained to simulate would output.
Whereas when you’re training a LLM with RL, you’re optimizing the entire LLM towards outputting the kind of output that an agent that would have a reward of 100 would output. And thus, the network becomes this kind of agent and is no longer a simulator because when you’re optimizing a single point (a given reward of an agent), it’s easier to just be the agent than to simulate it. It now has preferences that are not about reproducing correctly the distribution it has been trained on but maximizing some reward functions that it internalizes. I’d expect this to correlate with a greater likelihood of taking over the world, because preferences incentivize to take out-of-distributions actions, do long-term planning to reach specific goals etc.
In a nutshell:
On one end, there are pure simulators (no preferences except sampling accurately from the distribution) and on the other end there are agents (preferences, and all the outputs are filtered and shaped by these preferences)
DT seem to be more on the pure simulators side than on the agents side compared with RL.
Pure simulators are safer than agents because they don’t have preferences which leads them to stay more in-distribution, and thus for a given level of accuracy I’d expect DT to be safer
If it’s true that DT is a safer trial and error method than RL, then it could be worth differentially increasing the capabilities of DT over pure RL.
I know that the notions that I manipulate “be the agent vs simulate the agent” / “preferences vs simulation” are fuzzy, but I feel like it still makes sense.
My take would be: Okay, so you have achieved that, instead of the whole LLM being an agent, it just simulates an agent. Has this gained much for us? I feel like this is (almost exactly) as problematic. The simulated agent can just treat the whole LLM as its environment (together with the outside world), and so try to game it like any agentic enough misaligned AI would: it can act deceptively so as to keep being simulated inside the LLM, try to gain power in the outside world which (if it has a good enough understanding of minimizing loss) it knows is the most useful world model (so that it will express its goals as variables in that world model), etc. That is, you have just pushed the problem one step back, and instead of the LLM-real world frontier, you must worry about the agent-LLM frontier.
Of course we can talk more empirically about how likely and when these dynamics will arise. And it might well be that the agent being enclosed in the LLM, facing one further frontier between itself and real-world variables, is less likely to arrive at real-world variables. But I wouldn’t count on it, since the relationship between the LLM and the real world would seem way more complex than the relationship between the agent and the LLM, and so most of the work is gaming the former barrier, not the latter.
I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it’s only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via trial and error remains a simulated agent activated by the LLM, but the LLM is not the agent itself. Thus, the LLM keeps being a simulator and has no preferences over the output except that it corresponds to the kind of output that the agent it has been trained to simulate would output.
Whereas when you’re training a LLM with RL, you’re optimizing the entire LLM towards outputting the kind of output that an agent that would have a reward of 100 would output. And thus, the network becomes this kind of agent and is no longer a simulator because when you’re optimizing a single point (a given reward of an agent), it’s easier to just be the agent than to simulate it. It now has preferences that are not about reproducing correctly the distribution it has been trained on but maximizing some reward functions that it internalizes. I’d expect this to correlate with a greater likelihood of taking over the world, because preferences incentivize to take out-of-distributions actions, do long-term planning to reach specific goals etc.
In a nutshell:
On one end, there are pure simulators (no preferences except sampling accurately from the distribution) and on the other end there are agents (preferences, and all the outputs are filtered and shaped by these preferences)
DT seem to be more on the pure simulators side than on the agents side compared with RL.
Pure simulators are safer than agents because they don’t have preferences which leads them to stay more in-distribution, and thus for a given level of accuracy I’d expect DT to be safer
If it’s true that DT is a safer trial and error method than RL, then it could be worth differentially increasing the capabilities of DT over pure RL.
I know that the notions that I manipulate “be the agent vs simulate the agent” / “preferences vs simulation” are fuzzy, but I feel like it still makes sense.
What do you think about this?
My take would be: Okay, so you have achieved that, instead of the whole LLM being an agent, it just simulates an agent. Has this gained much for us? I feel like this is (almost exactly) as problematic. The simulated agent can just treat the whole LLM as its environment (together with the outside world), and so try to game it like any agentic enough misaligned AI would: it can act deceptively so as to keep being simulated inside the LLM, try to gain power in the outside world which (if it has a good enough understanding of minimizing loss) it knows is the most useful world model (so that it will express its goals as variables in that world model), etc. That is, you have just pushed the problem one step back, and instead of the LLM-real world frontier, you must worry about the agent-LLM frontier.
Of course we can talk more empirically about how likely and when these dynamics will arise. And it might well be that the agent being enclosed in the LLM, facing one further frontier between itself and real-world variables, is less likely to arrive at real-world variables. But I wouldn’t count on it, since the relationship between the LLM and the real world would seem way more complex than the relationship between the agent and the LLM, and so most of the work is gaming the former barrier, not the latter.