Sounds like Decision Transformers (DTs) could quickly become powerful decision-making agents. Some questions about them for anybody who’s interested:
DT Progress and Predictions
Outside Gato, where have decision transformers been deployed? Gwern shows several good reasons to expect that performance could quickly scale up (self-training, meta-learning, mixture of experts, etc.). Do you expect the advantages of DTs to improve state of the art performance on key RL benchmark tasks, or are the long-term implications of DTs more difficult to measure? Focusing on the compute costs of training and deployment, will DTs be performance competitive with other RL systems at current and future levels of compute?
Key Domains for DTs
Transformers have succeeded in data-rich domains such as language and vision. Domains with lots of data allow the models to take advantage of growing compute budgets and keep up with high-growth scaling trends. RL has similarly benefitted from self-play for nearly infinite training data. In what domains do you expect DTs to succeed? Would you call out any specific critical capabilities that could lead to catastrophic harm from DTs? Where do you expect DTs to fail?
My current answer would focus on risks from language models, though I’d be interested to hear about specific threats from multimodal models. Previous work has shown threats from misinformation and persuasion. You could also consider threats from offensive cyberweapons assisted by LMs and potential paths to using weapons of mass destruction.
These risks exist with current transformers, but DTs / RL + LMs open a whole new can of worms. You get all of the standard concerns about agents: power seeking, reward hacking, inner optimizers. If you wrote Gwern’s realistic tale of doom for Decision Transformers, what would change?
DT Safety Techniques
What current AI safety techniques would you like to see applied to decision transformers? Will Anthropic’s RLHF methods help decision transformers learn more nuanced reward models for human preferences? Or will the signal be too easily Goodharted, improving capabilities without asymmetrically improving AI safety? What about Redwood’s high reliability rejection sampling—does it looks promising for monitoring the decisions made by DTs?
Generally speaking, are you concerned about capabilities externalities? Deepmind and OpenAI seem to have released several of the most groundbreaking models of the last five years, a strategic choice made by safety-minded people. Would you have preferred slower progress towards AGI at the expense of not conducting safety research on cutting-edge systems?
Sounds like Decision Transformers (DTs) could quickly become powerful decision-making agents. Some questions about them for anybody who’s interested:
DT Progress and Predictions
Outside Gato, where have decision transformers been deployed? Gwern shows several good reasons to expect that performance could quickly scale up (self-training, meta-learning, mixture of experts, etc.). Do you expect the advantages of DTs to improve state of the art performance on key RL benchmark tasks, or are the long-term implications of DTs more difficult to measure? Focusing on the compute costs of training and deployment, will DTs be performance competitive with other RL systems at current and future levels of compute?
Key Domains for DTs
Transformers have succeeded in data-rich domains such as language and vision. Domains with lots of data allow the models to take advantage of growing compute budgets and keep up with high-growth scaling trends. RL has similarly benefitted from self-play for nearly infinite training data. In what domains do you expect DTs to succeed? Would you call out any specific critical capabilities that could lead to catastrophic harm from DTs? Where do you expect DTs to fail?
My current answer would focus on risks from language models, though I’d be interested to hear about specific threats from multimodal models. Previous work has shown threats from misinformation and persuasion. You could also consider threats from offensive cyberweapons assisted by LMs and potential paths to using weapons of mass destruction.
These risks exist with current transformers, but DTs / RL + LMs open a whole new can of worms. You get all of the standard concerns about agents: power seeking, reward hacking, inner optimizers. If you wrote Gwern’s realistic tale of doom for Decision Transformers, what would change?
DT Safety Techniques
What current AI safety techniques would you like to see applied to decision transformers? Will Anthropic’s RLHF methods help decision transformers learn more nuanced reward models for human preferences? Or will the signal be too easily Goodharted, improving capabilities without asymmetrically improving AI safety? What about Redwood’s high reliability rejection sampling—does it looks promising for monitoring the decisions made by DTs?
Generally speaking, are you concerned about capabilities externalities? Deepmind and OpenAI seem to have released several of the most groundbreaking models of the last five years, a strategic choice made by safety-minded people. Would you have preferred slower progress towards AGI at the expense of not conducting safety research on cutting-edge systems?