This review is nice but it is a bit to vague to be useful, to be honest. What new capabilities, that would actually have economic value, are enabled here? It seems this is very relevant to robotics and transfer between robotic tasks. So maybe that?
Looking at figure 9 in the paper the “accelerated learning” from training on multiple tasks seems small.
Note the generalist agent I believe has to be trained on all things combined at once, it can’t be trained on things in serial (this would lead to catastrophic forgetting). Note this is very different than how humans learn and is a limitation of ML/DL. When you want the agent to learn a new task, I believe you have to retrain the whole thing from scratch on all tasks, which could be quite expensive.
It seems the ‘generalist agent’ is not better than the specialized agents in terms of performance, generally. Interestingly, the generalist agent can’t use text based tasks to help with image based tasks. Glancing at figure 17, it seems training on all tasks hurt the performance on the robotics task (if I’m understanding it right). T his is different than a human—a human who has read a manual on how to operate a forklift, for instance, would learn faster than a human who hasn’t read the manual. Are transformers like that? I don’t think we know but my guess is probably not, and the results of this paper support that.
So I can see an argument here that this points towards a future that is more like comprehensive AI services rather than a future where research is focused on building monolithic “AGIs”.. which would lower x-risk concerns, I think. To be clear I think the monolithic AGI future is much more likely, personally, but this paper makes me update slightly away from that, if anything.
It’s unclear that this is true: “Effect of scale on catastrophic forgetting in neural networks”. (The response on Twitter from catastrophic forgetting researchers to the news that their field might be a fake field of research, as easily solved by scale as, say, text style transfer, and that continual learning may just be another blessing of scale, was along the lines of “but using large models is cheating!” That is the sort of response which makes me more, not less, confident in a new research direction. New AI forecasting drinking game: whenever a noted researcher dismisses the prospect of scaling creating AGI as “boring”, drop your Metaculus forecast by 1 week.)
When you want the agent to learn a new task, I believe you have to retrain the whole thing from scratch on all tasks, which could be quite expensive.
No, you can finetune the model as-is. You can also stave off catastrophic forgetting by simply mixing in the old data. After all, it’s an off-policy approach using logged/offline data, so you can have as much of the old data available as you want—hard drive space is cheap.
It seems the ‘generalist agent’ is not better than the specialized agents in terms of performance, generally.
An “aside from that Ms Lincoln, how was the play” sort of observation. GPT-1 was SOTA using zero-shot at pretty much nothing, and GPT-2 often wasn’t better than specialized approaches either. The question is not whether the current, exact, small incarnation is SOTA at everything and is an all-singing-all-dancing silver bullet which will bring about the Singularity tomorrow and if it doesn’t, we should go all “Gato: A Disappointing Paper” and kick it to the curb. The question is whether it scales and has easily-overcome problems. That’s the beauty of scaling laws, they drag us out of the myopic muck of “yeah but it doesn’t set SOTA on everything right this second, so I can’t be bothered to care or have an opinion” in giving us lines on charts to extrapolate out to the (perhaps not very distant at all) future where they will become SOTA and enjoy broad transfer and sample-efficient learning and all that jazz, just as their unimodal forebears did.
So I can see an argument here that this points towards a future that is more like comprehensive AI services rather than a future where research is focused on building monolithic “AGIs”
I think this is strong evidence for monolithic AGIs, that at such a small scale, the problems of transfer and the past failures at multi-task learning have already largely vanished and we are already debating whether the glass is half-empty while it looks like it has good scaling using a simple super-general and efficiently-implementable Decision Transformer-esque architecture. I mean, do you think Adept is looking at Gato and going “oh no, our plans to train very large Transformers on every kind of software interaction in the world to create single general agents which can learn useful tasks almost instantly, for all niches, including the vast majority which would never be worth handcrafting specialized agents for—they’re doomed, Gato proves it. Look, this tiny model a hundredth the magnitude of what we intend to use, trained on thousands of time less and less diverse data, it is so puny that it trains perfectly stably but is not better than the specialized agents and has ambiguous transfer! What a devastating blow! Guess we’ll return all that VC money, this is an obvious dead end.” That seems… unlikely.
Thanks, yeah I agree overall. Large pre-trained models will be the future, because of the few shot learning if nothing else.
I think the point I was trying to make, though, is that this paper raises a question, at least to me, as to how well these models can share knowledge between tasks. But I want to stress again I haven’t read it in detail.
In theory, we expect that multi-task models should do better than single task because they can share knowledge between tasks. Of course, the model has to be big enough to handle both tasks. (In medical imaging, a lot of studies don’t show multi-task models to be better, but I suspect this is because they don’t make the multi-task models big enough.) It seemed what they were saying was it was only in the robotics tasks where they saw a lot of clear benefits to making it multi-task, but now that I read it again it seems they found benefits for some of the other tasks too. They do mention later that transfer across Atari games is challenging.
Another thing I want to point out is that at least right now training large models and parallelization the training over many GPUs/TPUs is really technically challenging. They even ran into hardware problems here which limited the context window they were able to use. I expect this to change though with better GPU/TPU hardware and software infrastructure.
Note : I haven’t studied any of this in detail!!!
This review is nice but it is a bit to vague to be useful, to be honest. What new capabilities, that would actually have economic value, are enabled here? It seems this is very relevant to robotics and transfer between robotic tasks. So maybe that?
Looking at figure 9 in the paper the “accelerated learning” from training on multiple tasks seems small.
Note the generalist agent I believe has to be trained on all things combined at once, it can’t be trained on things in serial (this would lead to catastrophic forgetting). Note this is very different than how humans learn and is a limitation of ML/DL. When you want the agent to learn a new task, I believe you have to retrain the whole thing from scratch on all tasks, which could be quite expensive.
It seems the ‘generalist agent’ is not better than the specialized agents in terms of performance, generally. Interestingly, the generalist agent can’t use text based tasks to help with image based tasks. Glancing at figure 17, it seems training on all tasks hurt the performance on the robotics task (if I’m understanding it right). T his is different than a human—a human who has read a manual on how to operate a forklift, for instance, would learn faster than a human who hasn’t read the manual. Are transformers like that? I don’t think we know but my guess is probably not, and the results of this paper support that.
So I can see an argument here that this points towards a future that is more like comprehensive AI services rather than a future where research is focused on building monolithic “AGIs”.. which would lower x-risk concerns, I think. To be clear I think the monolithic AGI future is much more likely, personally, but this paper makes me update slightly away from that, if anything.
It’s unclear that this is true: “Effect of scale on catastrophic forgetting in neural networks”. (The response on Twitter from catastrophic forgetting researchers to the news that their field might be a fake field of research, as easily solved by scale as, say, text style transfer, and that continual learning may just be another blessing of scale, was along the lines of “but using large models is cheating!” That is the sort of response which makes me more, not less, confident in a new research direction. New AI forecasting drinking game: whenever a noted researcher dismisses the prospect of scaling creating AGI as “boring”, drop your Metaculus forecast by 1 week.)
No, you can finetune the model as-is. You can also stave off catastrophic forgetting by simply mixing in the old data. After all, it’s an off-policy approach using logged/offline data, so you can have as much of the old data available as you want—hard drive space is cheap.
An “aside from that Ms Lincoln, how was the play” sort of observation. GPT-1 was SOTA using zero-shot at pretty much nothing, and GPT-2 often wasn’t better than specialized approaches either. The question is not whether the current, exact, small incarnation is SOTA at everything and is an all-singing-all-dancing silver bullet which will bring about the Singularity tomorrow and if it doesn’t, we should go all “Gato: A Disappointing Paper” and kick it to the curb. The question is whether it scales and has easily-overcome problems. That’s the beauty of scaling laws, they drag us out of the myopic muck of “yeah but it doesn’t set SOTA on everything right this second, so I can’t be bothered to care or have an opinion” in giving us lines on charts to extrapolate out to the (perhaps not very distant at all) future where they will become SOTA and enjoy broad transfer and sample-efficient learning and all that jazz, just as their unimodal forebears did.
I think this is strong evidence for monolithic AGIs, that at such a small scale, the problems of transfer and the past failures at multi-task learning have already largely vanished and we are already debating whether the glass is half-empty while it looks like it has good scaling using a simple super-general and efficiently-implementable Decision Transformer-esque architecture. I mean, do you think Adept is looking at Gato and going “oh no, our plans to train very large Transformers on every kind of software interaction in the world to create single general agents which can learn useful tasks almost instantly, for all niches, including the vast majority which would never be worth handcrafting specialized agents for—they’re doomed, Gato proves it. Look, this tiny model a hundredth the magnitude of what we intend to use, trained on thousands of time less and less diverse data, it is so puny that it trains perfectly stably but is not better than the specialized agents and has ambiguous transfer! What a devastating blow! Guess we’ll return all that VC money, this is an obvious dead end.” That seems… unlikely.
Thanks, yeah I agree overall. Large pre-trained models will be the future, because of the few shot learning if nothing else.
I think the point I was trying to make, though, is that this paper raises a question, at least to me, as to how well these models can share knowledge between tasks. But I want to stress again I haven’t read it in detail.
In theory, we expect that multi-task models should do better than single task because they can share knowledge between tasks. Of course, the model has to be big enough to handle both tasks. (In medical imaging, a lot of studies don’t show multi-task models to be better, but I suspect this is because they don’t make the multi-task models big enough.) It seemed what they were saying was it was only in the robotics tasks where they saw a lot of clear benefits to making it multi-task, but now that I read it again it seems they found benefits for some of the other tasks too. They do mention later that transfer across Atari games is challenging.
Another thing I want to point out is that at least right now training large models and parallelization the training over many GPUs/TPUs is really technically challenging. They even ran into hardware problems here which limited the context window they were able to use. I expect this to change though with better GPU/TPU hardware and software infrastructure.