There are limits, however: scaling alone would not allow Gato to exceed expert performance on diverse tasks, since it is trained to imitate the experts rather than to explore new behaviors and perform in novel ways.
Imitation can exceed experts or demonstrations: note that Gato reaches >=100%† expert performance on something like a third of tasks (Figure 5), and does look like it exceeds the 2 robot experts in Figure 10 & some in Figure 17. This is a common mistake about imitation learning and prompt engineering or Decision Transformer/Trajectory Transformer specifically.
An imitation-learning agent can surpass experts in a number of ways: first, experts (especially humans) may simply have ‘trembling hands’ and make errors occasionally at random; a trained agent which has mastered their policy can simply execute that policy perfectly, never having a brain fart; second, demonstrations can come from experts with different strengths and weaknesses, like a player which is good at the opening but fails in the endgame and vice versa, and by ‘stitching together’ experts, an agent can have the best of both worlds—why imitate the low-reward behaviors when you observe better high reward ones? Likewise for episodes: keep the good, throw out the bad, distill for a superior product. Self-distillation and self-ensembling are also relevant to note.
More broadly, if we aren’t super-picky about it being exactly Gato*, a Decision Transformer is a generative model of the environment, and so can be used straightforwardly for exploration or planning, exploiting the knowledge from all observed states & rewards, even demonstrations from randomized agents, to obtain better results up to the limit of its model of the environment (eg a chess-playing agent can plan for arbitrarily long to improve its next move, but if it hasn’t yet observed a castling or promotion, there’s going to be limits to how high its Elo strength can go). And it can then retrain on the planning, like MuZero, or self-distillation in DRL and for GPT-3.
More specifically, a Decision Transformer is used with a prompt: just as you can get better or worse code completions out of GPT-3 by prompting it with “an expert wrote this thoroughly-tested and documented code:” or “A amteur wrote sum codez and its liek this ok”, or just as you can prompt a CLIP or DALL-E model with “trending on artstation | ultra high-res | most beautiful image possible”, to make it try to extrapolate in its latent space to images never in the training dataset, you can ‘just ask it for performance’ by prompting it with a high ‘reward’ to sample its estimate of the most optimal trajectory, or even ask it to get ‘more than’ X reward. It will generalize over the states and observed rewards and implicitly infer pessimal or optimal performance as best as it can, and the smarter (bigger) it is, the better it will do this. Obvious implications for transfer or finetuning as the model gets bigger and can bring to bear more powerful priors and abilities like meta-learning (which we don’t see here because Gato is so small and they don’t test it in ways which would expose such capabilities in dramatic ways but we know from larger models how surprising they can be and how they can perform in novel ways...).
DL scaling sure is interesting.
* I am not quite sure if Gato is a DT or not, because if I understood the description, they explicitly train only on expert actions with observation context—but usually you’d train a causal Transformer packed so it also predicts all of the tokens of state/action/state/action.../state in the context window, the prefixes 1:n, because this is a huge performance win, and this is common enough that it usually isn’t mentioned, so even if they don’t explicitly say so, I think it’d wind up being a DT anyway. Unless they didn’t include the reward at all? (Rereading, I notice they filter the expert data to the highest-reward %. This is something that ought to be necessary only if the model is either very undersized so it’s too stupid to learn both good & bad behavior, or if it is not conditioning on the reward so you need to force it to implicitly condition on ‘an expert wrote this’, as it were, by deleting all the bad demonstrations.) Which would be a waste, but also easily changed for future agents.
† Regrettably, not broken out as a table or specific numbers provided anywhere so I’m not sure how much was >100%.
Sounds like Decision Transformers (DTs) could quickly become powerful decision-making agents. Some questions about them for anybody who’s interested:
DT Progress and Predictions
Outside Gato, where have decision transformers been deployed? Gwern shows several good reasons to expect that performance could quickly scale up (self-training, meta-learning, mixture of experts, etc.). Do you expect the advantages of DTs to improve state of the art performance on key RL benchmark tasks, or are the long-term implications of DTs more difficult to measure? Focusing on the compute costs of training and deployment, will DTs be performance competitive with other RL systems at current and future levels of compute?
Key Domains for DTs
Transformers have succeeded in data-rich domains such as language and vision. Domains with lots of data allow the models to take advantage of growing compute budgets and keep up with high-growth scaling trends. RL has similarly benefitted from self-play for nearly infinite training data. In what domains do you expect DTs to succeed? Would you call out any specific critical capabilities that could lead to catastrophic harm from DTs? Where do you expect DTs to fail?
My current answer would focus on risks from language models, though I’d be interested to hear about specific threats from multimodal models. Previous work has shown threats from misinformation and persuasion. You could also consider threats from offensive cyberweapons assisted by LMs and potential paths to using weapons of mass destruction.
These risks exist with current transformers, but DTs / RL + LMs open a whole new can of worms. You get all of the standard concerns about agents: power seeking, reward hacking, inner optimizers. If you wrote Gwern’s realistic tale of doom for Decision Transformers, what would change?
DT Safety Techniques
What current AI safety techniques would you like to see applied to decision transformers? Will Anthropic’s RLHF methods help decision transformers learn more nuanced reward models for human preferences? Or will the signal be too easily Goodharted, improving capabilities without asymmetrically improving AI safety? What about Redwood’s high reliability rejection sampling—does it looks promising for monitoring the decisions made by DTs?
Generally speaking, are you concerned about capabilities externalities? Deepmind and OpenAI seem to have released several of the most groundbreaking models of the last five years, a strategic choice made by safety-minded people. Would you have preferred slower progress towards AGI at the expense of not conducting safety research on cutting-edge systems?
Imitation can exceed experts or demonstrations: note that Gato reaches >=100%† expert performance on something like a third of tasks (Figure 5), and does look like it exceeds the 2 robot experts in Figure 10 & some in Figure 17. This is a common mistake about imitation learning and prompt engineering or Decision Transformer/Trajectory Transformer specifically.
An imitation-learning agent can surpass experts in a number of ways: first, experts (especially humans) may simply have ‘trembling hands’ and make errors occasionally at random; a trained agent which has mastered their policy can simply execute that policy perfectly, never having a brain fart; second, demonstrations can come from experts with different strengths and weaknesses, like a player which is good at the opening but fails in the endgame and vice versa, and by ‘stitching together’ experts, an agent can have the best of both worlds—why imitate the low-reward behaviors when you observe better high reward ones? Likewise for episodes: keep the good, throw out the bad, distill for a superior product. Self-distillation and self-ensembling are also relevant to note.
More broadly, if we aren’t super-picky about it being exactly Gato*, a Decision Transformer is a generative model of the environment, and so can be used straightforwardly for exploration or planning, exploiting the knowledge from all observed states & rewards, even demonstrations from randomized agents, to obtain better results up to the limit of its model of the environment (eg a chess-playing agent can plan for arbitrarily long to improve its next move, but if it hasn’t yet observed a castling or promotion, there’s going to be limits to how high its Elo strength can go). And it can then retrain on the planning, like MuZero, or self-distillation in DRL and for GPT-3.
More specifically, a Decision Transformer is used with a prompt: just as you can get better or worse code completions out of GPT-3 by prompting it with “an expert wrote this thoroughly-tested and documented code:” or “A amteur wrote sum codez and its liek this ok”, or just as you can prompt a CLIP or DALL-E model with “trending on artstation | ultra high-res | most beautiful image possible”, to make it try to extrapolate in its latent space to images never in the training dataset, you can ‘just ask it for performance’ by prompting it with a high ‘reward’ to sample its estimate of the most optimal trajectory, or even ask it to get ‘more than’ X reward. It will generalize over the states and observed rewards and implicitly infer pessimal or optimal performance as best as it can, and the smarter (bigger) it is, the better it will do this. Obvious implications for transfer or finetuning as the model gets bigger and can bring to bear more powerful priors and abilities like meta-learning (which we don’t see here because Gato is so small and they don’t test it in ways which would expose such capabilities in dramatic ways but we know from larger models how surprising they can be and how they can perform in novel ways...).
DL scaling sure is interesting.
* I am not quite sure if Gato is a DT or not, because if I understood the description, they explicitly train only on expert actions with observation context—but usually you’d train a causal Transformer packed so it also predicts all of the tokens of state/action/state/action.../state in the context window, the prefixes 1:n, because this is a huge performance win, and this is common enough that it usually isn’t mentioned, so even if they don’t explicitly say so, I think it’d wind up being a DT anyway. Unless they didn’t include the reward at all? (Rereading, I notice they filter the expert data to the highest-reward %. This is something that ought to be necessary only if the model is either very undersized so it’s too stupid to learn both good & bad behavior, or if it is not conditioning on the reward so you need to force it to implicitly condition on ‘an expert wrote this’, as it were, by deleting all the bad demonstrations.) Which would be a waste, but also easily changed for future agents.
† Regrettably, not broken out as a table or specific numbers provided anywhere so I’m not sure how much was >100%.
Sounds like Decision Transformers (DTs) could quickly become powerful decision-making agents. Some questions about them for anybody who’s interested:
DT Progress and Predictions
Outside Gato, where have decision transformers been deployed? Gwern shows several good reasons to expect that performance could quickly scale up (self-training, meta-learning, mixture of experts, etc.). Do you expect the advantages of DTs to improve state of the art performance on key RL benchmark tasks, or are the long-term implications of DTs more difficult to measure? Focusing on the compute costs of training and deployment, will DTs be performance competitive with other RL systems at current and future levels of compute?
Key Domains for DTs
Transformers have succeeded in data-rich domains such as language and vision. Domains with lots of data allow the models to take advantage of growing compute budgets and keep up with high-growth scaling trends. RL has similarly benefitted from self-play for nearly infinite training data. In what domains do you expect DTs to succeed? Would you call out any specific critical capabilities that could lead to catastrophic harm from DTs? Where do you expect DTs to fail?
My current answer would focus on risks from language models, though I’d be interested to hear about specific threats from multimodal models. Previous work has shown threats from misinformation and persuasion. You could also consider threats from offensive cyberweapons assisted by LMs and potential paths to using weapons of mass destruction.
These risks exist with current transformers, but DTs / RL + LMs open a whole new can of worms. You get all of the standard concerns about agents: power seeking, reward hacking, inner optimizers. If you wrote Gwern’s realistic tale of doom for Decision Transformers, what would change?
DT Safety Techniques
What current AI safety techniques would you like to see applied to decision transformers? Will Anthropic’s RLHF methods help decision transformers learn more nuanced reward models for human preferences? Or will the signal be too easily Goodharted, improving capabilities without asymmetrically improving AI safety? What about Redwood’s high reliability rejection sampling—does it looks promising for monitoring the decisions made by DTs?
Generally speaking, are you concerned about capabilities externalities? Deepmind and OpenAI seem to have released several of the most groundbreaking models of the last five years, a strategic choice made by safety-minded people. Would you have preferred slower progress towards AGI at the expense of not conducting safety research on cutting-edge systems?