Does generality pay? GPT-3 can provide preliminary evidence.
One open question in predicting the future development of AI is whether individual AI systems will tend to remain highly specialized or become more general over time. In the most recent 80,000 Hours Podcast, Ben Garfinkel says:
Another argument for [AI systems staying specialized] is that it seems like specialized systems often outperform general ones, or it’s often easier to make, let’s say two specialized systems, one which performs task A and one that performs task B pretty well, rather than a single system that does both well. And it seems like this is often the case to some extent in AI research today. It’s easier to create individual systems that can play a single Atari game quite well than it is to create one system that plays all Atari games quite well. And it also seems like it’s a general, maybe economic or biological principle or something like that in lots of current cases. There are benefits from specialization as you get more systems that are interacting. So biologically, when you have a larger organism, cells tend to become more specialized, or economically, as you have a more sophisticated complex economy that does more stuff, it tends to be the case that you have greater specialization in terms of worker’s skills.
The question here is whether it’s more cost-effective to develop and use more general or more narrow AI systems. That is, we can develop either a suite of narrow AI systems that each perform some service at some level of competence, or a single, more general AI system that performs all of those services at the same levels of competence. Whichever collection of AI systems is cheaper to develop and use is the solution that will more likely be adopted by society.
We now have some examples of relatively general AI: for example, GPT-3 is a language model that performs decently well on a range of natural language processing (NLP) tasks, even though it’s only expressly trained to predict the next few tokens in a piece of text. The GPT-3 paper tells us how GPT-3 compares to state-of-the-art (SOTA) systems on the tasks it was tested on. In theory, we should be able to find out how much those SOTA systems altogether cost to produce and how much GPT-3 cost to produce, and compare the two.
But there’s another element to cost-effectiveness: how easy-to-use is the interface to the AI system. It might be that even though GPT-3 costs slightly more to produce than a suite of specialized AI systems, GPT-3 provides more value because it’s easier to consume the service. For example, the interfaces to GPT-3 and a specialized machine translation system might look like this:
// GPT-3
do_language_task("Translate English to French: The quick brown fox jumps over the lazy dog.")
// Narrow machine translation service
translate("The quick brown fox jumps over the lazy dog.", from="en", to="fr")
The GPT-3 interface is more elegant, since it allows you to specify tasks in natural language rather than learn the APIs for individual types of tasks. It also allows you to specify tasks that no one has trained a narrow AI system for, along with a relatively small number of examples (what the authors call few-shot learning). This tips the balance in favor of more general AI systems like GPT-3.
If GPT-3 is cheaper to develop and use across a representative basket of NLP tasks than an equivalent suite of specialized NLP systems, then the market will likely favor more general AI systems for NLP tasks. This evidence will provide insight as to whether society will develop more general AI systems or continue to produce narrow ones in the future.
Ben here: Great post!
Something I didn’t really touch on the interview is factors that might push in the direction of generality. I’ve never considered user-friendliness as a factor that might be important—but I think you’re right at least about the case of GPT-3. I also agree that empirical work investigating the value of generality will probably be increasingly useful.
Some other potential factors that might count in favor of generality:
*It seems like limited data availability can push in the direction of generality. For example: If we wanted to create a system capable of producing Shakespearean sonnets, and we had a trillion examples of Shakespearean sonnets, I imagine that the best and most efficient way to create this system would be to train it only on Shakespearean sonnets. But, since we don’t have that many Shakespearean sonnets, it of course ends up being useful to first train the system on a more inclusive corpus of English-language text (as in the case of GPT-3) and then fine-tune it on the smaller Shakespeare dataset. In this way, creating general systems can end up being useful (or even necessary) for creating systems that can perform specific tasks. (Although this argument is consistent with more general systems being used in training, but more narrow systems ultimately being deployed.)
*If you’re pretty unsure what tasks you’ll want AI systems to perform in some context—and it’s slow or costly to create new narrow AI systems, to figure out what existing narrow AI system would be appropriate for the tasks that come up, or to switch to using new narrow AI systems—then it may simply be more efficient to use very general AI systems that can handle a wide range of tasks.
*If you use multiple dinstinct systems to get some job done, there’s a cost to coordinating them, which might avoided if you use a single more unified system. For example, as a human analogy, if three people people want to cook a meal together, then some energy is going to need to go into deciding who does what, keeping track of each person’s progress, etc. The costs of coordinating multiple specialized units can sometimes outweigh the benefits of specialization.
I think the “CAIS response” to the latter two points would probably be that AI-driven R&D processes might eventually get really good at quickly spinning up new AI systems, as needed, and coordinating the use of multiple systems as needed. Personally unsure whether or not I find that compelling, in the long run.
Limited data availability and generality in practice now: this paper ( https://arxiv.org/abs/2006.16668 ) about how improving translation performance for “low resource” languages with not many training examples available relies on “positive language transfer” from training on other languages.
It seems like the hyperlink of the arxiv webpage is invalid (i.e. when you click on the arxiv link).
Fixed! Whoops.