Error
Unrecognized LW server error:
Field "fmCrosspost" of type "CrosspostOutput" must have a selection of subfields. Did you mean "fmCrosspost { ... }"?
Unrecognized LW server error:
Field "fmCrosspost" of type "CrosspostOutput" must have a selection of subfields. Did you mean "fmCrosspost { ... }"?
Imitation can exceed experts or demonstrations: note that Gato reaches >=100%† expert performance on something like a third of tasks (Figure 5), and does look like it exceeds the 2 robot experts in Figure 10 & some in Figure 17. This is a common mistake about imitation learning and prompt engineering or Decision Transformer/Trajectory Transformer specifically.
An imitation-learning agent can surpass experts in a number of ways: first, experts (especially humans) may simply have ‘trembling hands’ and make errors occasionally at random; a trained agent which has mastered their policy can simply execute that policy perfectly, never having a brain fart; second, demonstrations can come from experts with different strengths and weaknesses, like a player which is good at the opening but fails in the endgame and vice versa, and by ‘stitching together’ experts, an agent can have the best of both worlds—why imitate the low-reward behaviors when you observe better high reward ones? Likewise for episodes: keep the good, throw out the bad, distill for a superior product. Self-distillation and self-ensembling are also relevant to note.
More broadly, if we aren’t super-picky about it being exactly Gato*, a Decision Transformer is a generative model of the environment, and so can be used straightforwardly for exploration or planning, exploiting the knowledge from all observed states & rewards, even demonstrations from randomized agents, to obtain better results up to the limit of its model of the environment (eg a chess-playing agent can plan for arbitrarily long to improve its next move, but if it hasn’t yet observed a castling or promotion, there’s going to be limits to how high its Elo strength can go). And it can then retrain on the planning, like MuZero, or self-distillation in DRL and for GPT-3.
More specifically, a Decision Transformer is used with a prompt: just as you can get better or worse code completions out of GPT-3 by prompting it with “an expert wrote this thoroughly-tested and documented code:” or “A amteur wrote sum codez and its liek this ok”, or just as you can prompt a CLIP or DALL-E model with “trending on artstation | ultra high-res | most beautiful image possible”, to make it try to extrapolate in its latent space to images never in the training dataset, you can ‘just ask it for performance’ by prompting it with a high ‘reward’ to sample its estimate of the most optimal trajectory, or even ask it to get ‘more than’ X reward. It will generalize over the states and observed rewards and implicitly infer pessimal or optimal performance as best as it can, and the smarter (bigger) it is, the better it will do this. Obvious implications for transfer or finetuning as the model gets bigger and can bring to bear more powerful priors and abilities like meta-learning (which we don’t see here because Gato is so small and they don’t test it in ways which would expose such capabilities in dramatic ways but we know from larger models how surprising they can be and how they can perform in novel ways...).
DL scaling sure is interesting.
* I am not quite sure if Gato is a DT or not, because if I understood the description, they explicitly train only on expert actions with observation context—but usually you’d train a causal Transformer packed so it also predicts all of the tokens of state/action/state/action.../state in the context window, the prefixes 1:n, because this is a huge performance win, and this is common enough that it usually isn’t mentioned, so even if they don’t explicitly say so, I think it’d wind up being a DT anyway. Unless they didn’t include the reward at all? (Rereading, I notice they filter the expert data to the highest-reward %. This is something that ought to be necessary only if the model is either very undersized so it’s too stupid to learn both good & bad behavior, or if it is not conditioning on the reward so you need to force it to implicitly condition on ‘an expert wrote this’, as it were, by deleting all the bad demonstrations.) Which would be a waste, but also easily changed for future agents.
† Regrettably, not broken out as a table or specific numbers provided anywhere so I’m not sure how much was >100%.
Sounds like Decision Transformers (DTs) could quickly become powerful decision-making agents. Some questions about them for anybody who’s interested:
DT Progress and Predictions
Outside Gato, where have decision transformers been deployed? Gwern shows several good reasons to expect that performance could quickly scale up (self-training, meta-learning, mixture of experts, etc.). Do you expect the advantages of DTs to improve state of the art performance on key RL benchmark tasks, or are the long-term implications of DTs more difficult to measure? Focusing on the compute costs of training and deployment, will DTs be performance competitive with other RL systems at current and future levels of compute?
Key Domains for DTs
Transformers have succeeded in data-rich domains such as language and vision. Domains with lots of data allow the models to take advantage of growing compute budgets and keep up with high-growth scaling trends. RL has similarly benefitted from self-play for nearly infinite training data. In what domains do you expect DTs to succeed? Would you call out any specific critical capabilities that could lead to catastrophic harm from DTs? Where do you expect DTs to fail?
My current answer would focus on risks from language models, though I’d be interested to hear about specific threats from multimodal models. Previous work has shown threats from misinformation and persuasion. You could also consider threats from offensive cyberweapons assisted by LMs and potential paths to using weapons of mass destruction.
These risks exist with current transformers, but DTs / RL + LMs open a whole new can of worms. You get all of the standard concerns about agents: power seeking, reward hacking, inner optimizers. If you wrote Gwern’s realistic tale of doom for Decision Transformers, what would change?
DT Safety Techniques
What current AI safety techniques would you like to see applied to decision transformers? Will Anthropic’s RLHF methods help decision transformers learn more nuanced reward models for human preferences? Or will the signal be too easily Goodharted, improving capabilities without asymmetrically improving AI safety? What about Redwood’s high reliability rejection sampling—does it looks promising for monitoring the decisions made by DTs?
Generally speaking, are you concerned about capabilities externalities? Deepmind and OpenAI seem to have released several of the most groundbreaking models of the last five years, a strategic choice made by safety-minded people. Would you have preferred slower progress towards AGI at the expense of not conducting safety research on cutting-edge systems?
As a non-technical person struggling to wrap my head around AI developments, I really appreciated this post! I thought it was a good length and level of technicality, and would love to read more things like it!
For what it’s worth, as a layperson, I found it pretty hard to follow properly. I also think there’s a selection effect where people who found it easy will post but people who found it hard won’t.
this is really good to know, thank you!! I’m thinking we hit more of a ‘familiar with some technical concepts/lingo’ accessibility level rather than being accessible to people who truly have no/little familiarity with the field/concepts.
Curious if that seems right or not (maybe some aspects of this post are just broadly confusing). I was hoping this could be accessible to anyone so will have to try and hit that mark better in the future.
Ah, I made an error here, I misread what was in which thread and thought Amber was talking about Gwern’s comment rather than your original post. The post itself is fine! Sorry!
Oh that’s totally okay, thanks for clarifying!! And good to get more feedback because I was/am still trying to collect info on how accessible this is
Ditto here :)
This is a terrific distillation, thanks for sharing! I really like the final three sections with implications for short-term, long-term, and policy risks.
These are some great examples of US executive agencies that make policy decisions about AI systems. You could also include financial regulators (SEC, CFPB, Treasury) and national defense (DOD, NSA, CIA, FBI). Not many people in these agencies work on AI, but 80,000 Hours argues that those who do could make impactful decisions while building career capital.
Thank you for writing this! I found it very helpful as I only saw headlines about Gato before and am not watching developments in AI closely. I liked the length and style of writing very much and would appreciate similar posts in future.
Note : I haven’t studied any of this in detail!!!
This review is nice but it is a bit to vague to be useful, to be honest. What new capabilities, that would actually have economic value, are enabled here? It seems this is very relevant to robotics and transfer between robotic tasks. So maybe that?
Looking at figure 9 in the paper the “accelerated learning” from training on multiple tasks seems small.
Note the generalist agent I believe has to be trained on all things combined at once, it can’t be trained on things in serial (this would lead to catastrophic forgetting). Note this is very different than how humans learn and is a limitation of ML/DL. When you want the agent to learn a new task, I believe you have to retrain the whole thing from scratch on all tasks, which could be quite expensive.
It seems the ‘generalist agent’ is not better than the specialized agents in terms of performance, generally. Interestingly, the generalist agent can’t use text based tasks to help with image based tasks. Glancing at figure 17, it seems training on all tasks hurt the performance on the robotics task (if I’m understanding it right). T his is different than a human—a human who has read a manual on how to operate a forklift, for instance, would learn faster than a human who hasn’t read the manual. Are transformers like that? I don’t think we know but my guess is probably not, and the results of this paper support that.
So I can see an argument here that this points towards a future that is more like comprehensive AI services rather than a future where research is focused on building monolithic “AGIs”.. which would lower x-risk concerns, I think. To be clear I think the monolithic AGI future is much more likely, personally, but this paper makes me update slightly away from that, if anything.
It’s unclear that this is true: “Effect of scale on catastrophic forgetting in neural networks”. (The response on Twitter from catastrophic forgetting researchers to the news that their field might be a fake field of research, as easily solved by scale as, say, text style transfer, and that continual learning may just be another blessing of scale, was along the lines of “but using large models is cheating!” That is the sort of response which makes me more, not less, confident in a new research direction. New AI forecasting drinking game: whenever a noted researcher dismisses the prospect of scaling creating AGI as “boring”, drop your Metaculus forecast by 1 week.)
No, you can finetune the model as-is. You can also stave off catastrophic forgetting by simply mixing in the old data. After all, it’s an off-policy approach using logged/offline data, so you can have as much of the old data available as you want—hard drive space is cheap.
An “aside from that Ms Lincoln, how was the play” sort of observation. GPT-1 was SOTA using zero-shot at pretty much nothing, and GPT-2 often wasn’t better than specialized approaches either. The question is not whether the current, exact, small incarnation is SOTA at everything and is an all-singing-all-dancing silver bullet which will bring about the Singularity tomorrow and if it doesn’t, we should go all “Gato: A Disappointing Paper” and kick it to the curb. The question is whether it scales and has easily-overcome problems. That’s the beauty of scaling laws, they drag us out of the myopic muck of “yeah but it doesn’t set SOTA on everything right this second, so I can’t be bothered to care or have an opinion” in giving us lines on charts to extrapolate out to the (perhaps not very distant at all) future where they will become SOTA and enjoy broad transfer and sample-efficient learning and all that jazz, just as their unimodal forebears did.
I think this is strong evidence for monolithic AGIs, that at such a small scale, the problems of transfer and the past failures at multi-task learning have already largely vanished and we are already debating whether the glass is half-empty while it looks like it has good scaling using a simple super-general and efficiently-implementable Decision Transformer-esque architecture. I mean, do you think Adept is looking at Gato and going “oh no, our plans to train very large Transformers on every kind of software interaction in the world to create single general agents which can learn useful tasks almost instantly, for all niches, including the vast majority which would never be worth handcrafting specialized agents for—they’re doomed, Gato proves it. Look, this tiny model a hundredth the magnitude of what we intend to use, trained on thousands of time less and less diverse data, it is so puny that it trains perfectly stably but is not better than the specialized agents and has ambiguous transfer! What a devastating blow! Guess we’ll return all that VC money, this is an obvious dead end.” That seems… unlikely.
Thanks, yeah I agree overall. Large pre-trained models will be the future, because of the few shot learning if nothing else.
I think the point I was trying to make, though, is that this paper raises a question, at least to me, as to how well these models can share knowledge between tasks. But I want to stress again I haven’t read it in detail.
In theory, we expect that multi-task models should do better than single task because they can share knowledge between tasks. Of course, the model has to be big enough to handle both tasks. (In medical imaging, a lot of studies don’t show multi-task models to be better, but I suspect this is because they don’t make the multi-task models big enough.) It seemed what they were saying was it was only in the robotics tasks where they saw a lot of clear benefits to making it multi-task, but now that I read it again it seems they found benefits for some of the other tasks too. They do mention later that transfer across Atari games is challenging.
Another thing I want to point out is that at least right now training large models and parallelization the training over many GPUs/TPUs is really technically challenging. They even ran into hardware problems here which limited the context window they were able to use. I expect this to change though with better GPU/TPU hardware and software infrastructure.