Up until the release of OPT-175B in May 2022, incremental research(i.e., research that makes a relatively small change to an existing method) was the prevailing diffusion mechanism for actors to gain direct access to the weights of a GPT-3-like model; nine GPT-3-like models were developed in that way prior to OPT-175B, of which none had their weights made widely accessible.[1] The wider accessibility of OPT-175B changed the prevailing mechanism to open publication, based on my estimate that more actors have direct access to the OPT-175B model weights than actors that have developed GPT-3-like models themselves. (more)
I don’t think that multiple discovery (i.e., two actors independently coming up with the same idea or result) was significantly involved in the diffusion of GPT-3-like models. In particular, I think if the relevant papers weren’t published, it would’ve been 6 months (90% CI: 1 to 18 months) before any other actor would’ve discovered either a model with GPT-3’s capabilities or the scaling laws it was based on.
It’s plausible that the publication of the GPT-3 and scaling laws papers was unnecessarily early in terms of beating other actors to the punch, but I don’t have enough evidence to be confident in that claim. Regardless, I think that developers should have more scrutiny about whether they are really in a race to publish, because the harm of accelerating AI capabilities could outweigh the benefit of publishing first with a more responsible strategy (in order to establish better publication norms).
Access to compute appears to have been the main factor hindering the diffusion of GPT-3-like models. The next biggest hindering factors appear to have been acquiring the necessary machine learning and engineering expertise. (more)
I estimate that compute makes up 87% (90% CI: 64% to 98%) of the combined compute cost and salary cost of GPT-3-like model development.[2]
The largest accelerating factors in the cases I studied (i.e., factors that aren’t necessary for developing GPT-3-like models but that seemed to make development easier or more likely) are, in order of apparent importance, (1) publicity about GPT-3’s capabilities, (2) the sponsorship of compute resources, and (3) the release of open-source tools for large-scale model training. (more)
My best guess is that the release of GPT-3 sped up both DeepMind and Google’s work on language model scaling by six months (90% CI: 1–18 months).My guess is based on (1) how unexpected GPT-3 was in terms of training compute, (2) the hype surrounding GPT-3 following its publication, and (3) comments from three people who have done research on large language models.[3] (more)
So far in the West, the sponsorship of compute has only helped academic groups, independent groups, and smaller AI companies catch up to where leading AI labs were in 2020. This can allow them to use models closer to the cutting edge than they’d otherwise have, to do research on such models, and to increase the number of people with access to these models (e.g., as happened with BLOOM open-sourcing its weights). In the future, sponsorship could give an academic institution or AI startup most of the resources they need to play a significantly larger role in AI development and diffusion. It is also an easy way for governments to play such a role via offering resources, even if they lack key talent in-house. This seems true most of all in China, where there are strong ties between the government and the AI research community.[4] (more)
Diffusion of closed-source GPT-3-like models has been accelerated by incremental progress in, and open publication of, artifacts that are relevant to a given model. Relevant artifacts include datasets, smaller models, specialized software tools, and the accumulation of published method details (e.g., parallelism strategies). I call this process a diffusion cascade—diffusion of model-relevant artifacts begets diffusion of the model itself. Diffusion cascades can be limited by minimizing the spread of model-relevant artifacts (rather than only avoiding publishing model weights or algorithmic insights). (more)
In addition to never publishing, delaying publication can be, and has been, successfully used to limit diffusion. (more)
I estimate that if, after GPT-3 was trained, the GPT-3 project team had done the necessary work to publish the GPT-3 paper (Brown et al., 2020) as soon as possible, it could have been ready to publish four months sooner than it was.
Similarly, I estimate that the Gopher paper (Rae et al., 2021) could have been ready to publish nine months sooner than it was (holding constant the time at which it was actually trained).
Both of these publication delays seemed to be (partly) motivated by a desire to delay wide access to a powerful model, until the potential harms of that model were better understood or more easily mitigated. I also think it’s likely that both of those delays significantly slowed diffusion of GPT-3-like models given (a) how much of an impact GPT-3 itself had, and (b) the additional insights about training large language models that the Gopher paper presented.[5]
The prevailing diffusion mechanism for GPT-3-like model weights was initially incremental research, then open publication
Up until the release of OPT-175B in May 2022, incremental research had been the prevailing diffusion mechanism for gaining direct access to the weights of a GPT-3-like model. After the release of OPT-175B, the prevailing mechanism has been the combination of replication and open publication. What follows is my reasoning and further thoughts on the mechanisms of diffusion of GPT-3-like models:
Incremental research can be seen as a variant of replication, where actors are probably capable of replication, but are more incentivised to surpass or otherwise differentiate themselves from the existing result.
While OPT-175B is an explicit replication attempt, all nine other GPT-3-like models I identified that came before it are cases of incremental research: for example, Gopher and PaLM are much larger models at 280 and 540 billion parameters respectively; Jurassic-1-Jumbo uses a different tokenizer to enhance the vocabulary of the model; Chinchilla uses a more compute-optimal training method.[6]
My guess is that the main reasons to do incremental research rather than replication are:
Improving performance on some of the same tasks that the original model performed, to make the new result more useful or impressive
Improving performance for different tasks or mediums, e.g., Chinese language rather than English language, to make the new result more useful or impressive
Making research results (seem) as novel as possible, so the work gets more attention and is more likely to be accepted to publication venues
Open publication of a GPT-3-like model required replication first, because the original developer (OpenAI) did not do open publication of GPT-3’s model weights. This state of affairs persisted until the first (somewhat) open publication of weights, with OPT-175B in May 2022. My understanding from the request form is that OPT-175B is available to AI researchers with at least one publication that is at minimum broadly relevant to OPT-175B. So I expect that the number of people with access to OPT-175B is now greater than the number of people who have worked on producing a GPT-3-like model from scratch.[7] Open publication is therefore now the prevailing mechanism of diffusion for GPT-3-like models.
I am not aware of any cases of leak, e.g., someone being granted access to a closed-source model and then publishing that model themselves without permission. This is based on not coincidentally hearing or reading about such a case in the course of my research.
I am not aware of any cases of theft or model stealing attacks on a GPT-3-like model. This is based on:
Not coincidentally hearing or reading about such a case in the course of my research
Nova DasSarma (who works on security at Anthropic) not recalling any cases of ML model theft offhand during an interview (with the caveat that a fully successful case of theft would go undetected, so we can’t know for sure)[8]
Jeffrey Ladish—who works on security for Anthropic—also not thinking of any real-world cases of ML model theft in conversation[9]
The extent that multiple discovery was involved in the diffusion of GPT-3-like models is more uncertain than the mechanisms above. However, after accounting for the evidence detailed below, I believe multiple discovery was not significantly involved in the diffusion of GPT-3-like models. In particular, I think if the relevant papers weren’t published, it would have taken six months (90% CI: 1 to 18 months) before any other actor would have discovered a model with the approximate capabilities of GPT-3.[10]
Below are reasons to think multiple discovery was involved in at least one case of GPT-3-like model diffusion:
Arguably the key insight behind GPT-3 was Scaling Laws for Neural Language Models (Kaplan et al., 2020)—the scaling laws implied that more-or-less direct scaling of compute, data, and parameter count from GPT-2 to GPT-3 would predictably achieve a lower loss, which is correlated with better performance on downstream tasks. The scaling laws paper was published to arxiv.org on January 23, 2020, four months before GPT-3 was publicized in May 2020. This plausibly allows just enough time for an actor that has already developed GPT-2-scale models to notice this insight and scale up to GPT-3 (e.g., two months to prepare, one month to train, and one month to evaluate and publish).[11]
It seems that predecessors to Gopher (a GPT-3-like model from DeepMind) were already being developed before GPT-3 was publicized.[12]
There is some evidence that people at OpenAI were worried about other actors developing a GPT-3-like model first, though it’s unclear to me how justified the concern was.[13]
Reasons to think multiple discovery was not involved in any cases of GPT-3-like model diffusion:
Geoffrey Irving (last author of Rae et al., 2021) told me that GPT-3 “did add an organizational push” for DeepMind to scale up language models to Gopher’s scale (which was 280 billion parameters and 5.8E+23 FLOPs of compute). This suggests that DeepMind would have produced a GPT-3-like model later, and certainly not earlier, if GPT-3 had not been published.
To my knowledge, nobody publicized a GPT-3-like model (according to my definition) until HyperClova, one year after GPT-3 was publicized (Naver, 2021). After that, many more GPT-3-like models were publicized. Based on my estimates of “Time from project start to final trained model”, one year is more than enough time to develop a GPT-3-like model. This suggests that these projects probably started after GPT-3 was publicized.
I estimate that GPT-3 arrived 11 months (90% CI: 5 to 17 months) earlier than expected, mostly based on trends in the amount of training compute used for ML systems at the time immediately before GPT-3 was publicized (see this appendix).[14]
Iulia Turc, a former software engineer at Google who worked on research using large language models, told me: “Somebody else would have inevitably reached the same scale [as GPT-3], but I really can’t make an educated guess about when. Research labs like Google clearly had the resources to do it even before OpenAI, but it’s unclear to me whether it would have been a priority.”[15] To me this suggests that Google had not already produced a model of the same or larger scale when GPT-3 was published, but on the other hand, there were actors (including Google) with the ability to do so. So the evidence from this quote seems roughly neutral on the question of multiple discovery overall, but informative nonetheless.
It’s possible that the publication of GPT-3 caused another developer to withhold publication of their own similar result, even though they were planning to publish at almost the same time—say, within two months. The reason the other developer might do this is to avoid losing recognition for their result, because it’s too similar to GPT-3. Instead, it might be better for them to do some further research and then publish a more novel result that gains more recognition. Doing this further research seems more likely than giving up on the project entirely, due to the sunk cost of training such a large model. However, I think it would most likely take less than six additional months for this further research to happen, and in fact no relevant publications came out within six months. So it seems less than 50% likely that this scenario happened.
My specific quantitative claim—that it would have taken 6 months (90% CI: 1 to 18 months) before any other actor would have discovered a model with the approximate capabilities of GPT-3—is based on (and the same as) an estimate I make in a later section about the impact of GPT-3’s publication.
It’s plausible that the publication of the GPT-3 and scaling laws papers was unnecessarily early in terms of beating other actors to the punch, but I don’t have enough evidence to be confident in that claim. Regardless, I think that developers should have more scrutiny about whether they are really in a race to publish, because the harm of accelerating AI capabilities could outweigh the benefit of publishing first with a more responsible strategy (in order to establish better publication norms).
Caveat: I expect the above conclusions to change when it comes to the diffusion of future state-of-the-art language models, due to:
More closed publication practices: in the next post of this sequence I’ll argue that publication decisions by top language model developers will become more closed on average than they are now. I think this will make incremental research relatively more prevalent compared to replication and open-sourcing.
Greater incentive for theft: while I’m not aware of any cases of model theft so far, I expect that the incentive for theft will increase as the capabilities of models improve and state-of-the-art models continue to be closed-source. Improved capabilities will increase the payoff of theft. And I expect that, although leading AI labs will take measures to improve security around their models, there will be a point (if we are not there already) where the cost of attempting to steal the model may be lower than attempting to replicate the model—at least for the most capable hackers, among which are state actors.[16] These claims are uncertain, and I think at least one month of further research on risks from theft would be worthwhile for someone in the AGI governance community to do.
In contrast to the above changes, I expect the diffusion of models with similar performance to GPT-3 (rather than greater performance) will accelerate in the future.
Hardware costs will fall and algorithmic efficiency will continue to improve, enabling more and more actors to develop these models. I also expect there will be diffusion of better open-source tools that make it easier to train and run these models (similar to Megatron-LM). Presumably, many actors will then openly publish their models for the usual reasons, enabling even more actors to acquire and use the models.
I also expect that the incentive for replication will decrease in the future as different GPT-3-like models are trained to address various use cases and languages, and those models will also get open-sourced for the usual reasons.
Most important factors for GPT-3-like model diffusion
Below I discuss the most important factors for diffusion that I determined in the course of my research and that fell within my scope. Note that these are factors that made developing GPT-3-like models easier or more likely by the largest margin in various cases.[17] I don’t consider the core resources for developing GPT-3-like models as “factors” themselves—those resources (mainly compute and talent) are discussed in the previous post. Overall, I’m 80% confident that all of these factors are important enough for a longtermist researcher to spend at least one month full-time thinking about how to beneficially affect each of these factors.[18]
I think that the difficulty of accessing enough compute has been the largest hindering factor to the diffusion of GPT-3-like models. This was the case up until the release of OPT-175B in May 2022, after which GPT-3-like models became much more accessible.[19] My claim is based on the following evidence:
The actors that have succeeded in producing GPT-3-like models have all needed on the order of $1–10 million available to spend on compute.[20] This cost is much larger than the cost of labor or acquiring training data, according to my estimates (shown below). Furthermore, no major algorithmic insights needed to be figured out once the GPT-3 paper was published.
If we compare compute with the cost for talent—measured just in terms of the total salary of the project team for the duration of the project—compute seems to be a much larger hindering factor in this domain.[21]This Guesstimate model compares labor to compute cost for the average project to develop a GPT-3-like model. It suggests that the total compute cost is 16x (90% CI: 3x to 81x) higher than the labor cost. However, this model does not account for the difficulty of acquiring specific talent in large language model training. I explore the barriers to acquiring talent (with no clear conclusion) in this appendix. Talent and compute cost are also partially exchangeable, as I discuss in this section.
In this appendix, I estimate that the cost of producing the unprocessed GPT-3 training dataset (including human labor) is one to two orders of magnitude lower than the compute cost for the final training run of GPT-3. Based on this, I am 90% confident that, for all other GPT-3-like models I investigated, producing or acquiring the dataset cost at least one order-of-magnitude less than the compute cost for training that model, given that all of these models seemed to use similar raw data or similar data-collection processes to GPT-3.
EleutherAI has so far failed to replicate GPT-3 because of limited access to GPUs (both due to CoreWeave’s budget, and chip supply shortages).
The PanGu-alpha model apparently failed to reach its full potential (given the parameter count of 200 billion and the dataset size of 1.1TB[22]) due to being undertrained—i.e., not enough compute was spent to train the model on an adequate number of tokens. I think this is most likely due to one or more of the following possibilities: (a) the authors ran out of time to complete the project, (b) the authors did not have the financial budget to train further, and/or (c) there was a technical problem during training that the authors did not know how to fix (before a deadline). I don’t have further evidence to distinguish these possibilities, but I put roughly equal weight on them, which means that (b) is significantly likely.
Difficulty acquiring the necessary machine learning and engineering expertise to execute the project (hindering factor)
I think that the difficulty of acquiring the necessary machine learning and engineering expertise was the second largest hindering factor to the diffusion of GPT-3-like models.To clarify, this claim is specifically about having the expertise to overcome the challenges of training large language models. This claim is not about the expertise to independently discover algorithmic insights, though I believe that is a lesser hindering factor. The claim is based on the following evidence:
Several experts I consulted with emphasized the importance of machine learning expertise, and in particular, engineering expertise, in developing large language models.
Iulia Turc, former Software Engineer at Google Research who worked with large language models such as BERT: “[I]ndustry labs have the upper hand [compared to universities] of good engineering talent. Since designing a model is very different from scaling it (i.e., training it in a distributed manner over a fleet of machines), it’s very important that scientists and engineers come together…”
A researcher training large language models at an AI safety lab:
“In academia, I think the main bottleneck [for replicating a large language model like GPT-3] is the ability to hire engineers to build out a codebase. The distributed training complexity is a rather different area of expertise than a Ph.D. student and doesn’t fit into the incentive structure very cleanly.”
“[To scale up machine learning at today’s cutting edge] I’d say you need all of: (1) Enough compute to train the model a number of times since the first try will probably not work, (2) Experts on distributed training of LMs [...] (3) Experts on ML. This doesn’t require as much creativity as most people might believe, but you do need to be able to understand what’s going on and debug your training process.”
Only AI labs with teams of 10 or more people have succeeded at producing GPT-3-like models.
BLOOM is a possible exception, depending on how one counts “core contributors”. The majority of the development contribution may not have been from HuggingFace employees, but rather from various academic and independent collaborators in the BigScience collective. I think the BLOOM project succeeded, talent-wise, through the crowd-sourcing of talent. The project was very open to contributors, and the best contributors naturally came to the fore. Also, by the time people worked on BLOOM, there was more accumulated knowledge from research e.g., BigScience’s previous work on the T0-XXL model (Sanh et al., 2021), and open-source tools like Megatron-LM.
Another potential counterpoint: Although the GPT-NeoX team didn’t succeed at creating a GPT-3-like model, according to the lead contributor Sid Black (in personal correspondence[23]), there was around a 40–50% chance that they had the requisite talent and just didn’t get access to enough compute.[24] I would guess Black’s claim is overconfident, given that the team didn’t get to actually attempt GPT-3 replication with enough compute—if they had done so, I expect they would encounter unforeseen challenges that would stretch the duration of the project significantly. But the claim that “the GPT-NeoX team would have succeeded at creating a GPT-3-like model by February 2022 if they had access to enough compute from the beginning of the project” seems more than 20% likely to me.
No government labs were directly involved in the cases I studied. I would guess this is partly because they don’t have the relevant talent and would have a hard time acquiring it (e.g., because AI industry labs offer higher prestige, salaries, research freedom, and less bureaucracy). Governments were only involved via funding, in the BLOOM and PanGu-alpha cases.[25]
As an aside, I take the lack of direct government involvement as evidence that governments are generally more willing to fund teams that already have the requisite talent than to acquire the requisite talent directly.
Sponsorship of compute resources by separate parties (accelerating factor)
So far, I think the most important factor for lower-resourced actors to approach GPT-3-like capabilities has been the sponsorship of compute by separate parties.This accelerating factor is the flip side of challenges of acquiring compute as a hindering factor—sponsorship allows these actors to leap over the obstacle of acquiring compute.
The first key example is that CoreWeave provided compute to EleutherAI for free to develop and train GPT-NeoX-20B. According to Sid Black, one of the main contributors to developing GPT-NeoX-20B, EleutherAI spent nothing out of pocket on compute for the GPT-NeoX project. Prior to this, EleutherAI was using a TensorFlow Research Cloud (TFRC) scheme that provided free access to TPUs, but this was not sufficient to train GPT-3.[26] The incentive for CoreWeave was to have their hardware tested as they were starting up their cloud computing operation, and to gain insight on what is required to use their hardware for training large language models.[27] The incentive for TFRC prior to this seemed to be testing their TPU hardware and advertising the advantages of that hardware.[28]
The second key example of compute sponsorship from my case studies is that BigScience was provided €3M from French research agencies CNRS and GENCI to train the BLOOM model on the Jean Zay supercomputer (BigScience, 2022).
Sponsorship can enable actors to use models closer to the cutting edge than they’d otherwise have, to do research on such models, and to increase the number of people with access to these models (e.g., as happened with BLOOM open-sourcing its weights). But does the sponsorship of resources like compute ultimately matter for who develops transformative AI (TAI)? I think the sponsorship of resources is less likely to matter than diffusion among AI developers who can already afford paying for the resources themselves, because the actors receiving sponsorship will tend to be lower-resourced to begin with, and therefore less likely to keep up with or surpass the state-of-the-art. However, I think sponsorship is a factor worth bearing in mind when thinking about which actors could plausibly become contenders to develop TAI in the future, and when thinking about how to beneficially shape diffusion.[29]
To see this, consider that the sponsorship of compute could give smaller actors the necessary momentum to become more significant actors. As with the BigScience case, there could also be a big role for governments and associated funding agencies to play in sponsoring massive amounts of resources for AI developers. This is already the case in China. The Beijing Academy of Artificial Intelligence, Zhejiang Lab, and Peng Cheng Lab are Chinese government-sponsored entities that have provided support for funding and compute to recent AI research projects in China (Ding & Xiao, forthcoming). For instance, Peng Cheng Lab was involved in PanGu-alpha.
Open-source tooling for large-scale model training (accelerating factor)
Open-source tools that are specifically designed for large-scale model training were a notable accelerating factor in the cases I studied. There are two things to clarify about this:
If these tools were proprietary (but available to the public as commercial software), I don’t think the cost of the tools would be prohibitive. But the open-source nature of the tools is still important, because open-source tools are easier to use in the ML domain. Based on my own experience with ML code development, it’s important to be able to integrate open-source code with other code, and often to customize the code extensively, in order to suit a given machine learning project.
I am not referring to tools that are as essential to ML as PyTorch. Tools like PyTorch provide a foundation for any modern ML project, having become ubiquitous in ML research and development. Rather, I am referring to newer, more specific tools such as Megatron-LM. Megatron-LM makes it easier to train large-scale models that use the Transformer architecture (which all the GPT-3-like models in the diffusion database do).
The Megatron-LM codebase was first published in September 2019. It started as the code implementing NVIDIA’s 8-billion parameter language model, Megatron, which was introduced in Shoeybi et al. (2019).[30] Megatron was heavily based on the 1.5-billion-parameter GPT-2, the predecessor of GPT-3.[31] The Megatron-LM codebase was later used in Narayanan et al. (2021),[32] which as the title suggests, offers useful insights on efficient large-scale language model training.
Shelvane (2022) claims that the Megatron code release “made it very easy for anyone to train GPT-2-like models if they had access to enough GPUs; Aaron [a Brown University graduate student who replicated GPT-2] told [the author] that with the Megatron code and enough money, a high school student could do it.”[33] By the same logic, I make a similar claim for the current Megatron-LM codebase (after the “efficient large-scale training” paper was published) with respect to GPT-3. The Megatron-LM codebase has formed a significant part of the overall code base for OPT-175B, Jurassic-1-Jumbo, GPT-NeoX-20B, BLOOM, and Megatron-Turing NLG—though the latter is not really relevant to diffusion, since NVIDIA was directly involved.[34]The fact that Meta AI and AI21 Labs both used Megatron-LM code suggests that they benefit from open-source tools released by other actors. So the benefit is not limited just to small actors that tend to have less engineering talent, such as academic labs or independent collectives.
It’s difficult to quantify how much the Megatron-LM code helps, and it certainly does not remove most of the compute cost. The code merely helps with implementation. But given the prevalence of the Megatron-LM code in my case studies, I expect that it significantly reduces the talent barrier to start a GPT-3-like model development project.It probably also saves time and money by improving efficiency. Sid Black of EleutherAI told me that Megatron-LM and another tool called DeepSpeed were frustrating and time-consuming to use and extend. Despite that, he said that Megatron-LM is “really fast” and he was glad to have these tools available when developing GPT-NeoX-20B.
A similar tool which is often used alongside Megatron-LM is Microsoft’s DeepSpeed. According to the GitHub repo, “DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.” DeepSpeed, or a “forked” version of it on GitHub, was used in all the case studies where Megatron-LM was used except OPT-175B (as far as I could tell).
Similar specialized open-source software is used by other AI developers. In the Chinese sphere, there is MindSpore, which was used to train PanGu-alpha. Google’s PaLM used T5X and JAX, while DeepMind’s Gopher and Chinchilla used JAX and Haiku—though these are less specialized for language model training than Megatron-LM is.
Publicity that draws attention to an existing model’s capabilities (accelerating factor)
Although it is difficult to measure and track the effects of the hype surrounding an AI research result, I believe that hype is an important accelerating factor in the diffusion of GPT-3-like models, and will probably play a key role in the diffusion of future state-of-the-art machine learning models. What I mean by hype is a combination of (a) the amount of attention that something gets, (b) the belief that the thing is promising in some way, e.g., it’s something worth replicating, or reveals a research direction worth pursuing. My point about the importance of hype here is related to my previous takeaway about the importance of attention to information.
First of all, GPT-3 was surprising in some sense. I estimate that GPT-3 was published 11 months earlier than expected based on training compute trends at the time (90% CI: 5 to 17 months).[35]Second, the insight which GPT-3 demonstrated was significant.Shelvane (2020, pp. 15-16) explains this point: “The idea [of the release strategy of GPT-2 and GPT-3] was that the models themselves were the hardest thing for bad actors to recreate, given the high compute costs required to produce the models. This was assuming that the papers, in contrast, did not contain truly novel insights. However, this focus on models has been questioned, with some risk-conscious AI researchers arguing that the GPT-3 paper was actually the risky thing. The paper, alongside other papers that OpenAI published in 2020, demonstrated to many onlookers the benefits of scale: if you throw a large amount of compute and data at a model with a very high number of parameters, you can get very impressive capabilities. Some people viewed this as dangerous in that it accelerates the field’s progress towards advanced AI, thus giving the world less time to prepare” (my emphasis).
A massive increase in hype around GPT-3 occurred not when the GPT-3 paper (Brown et al., 2020) was first published, but after people started demonstrating capabilities with the OpenAI API on Twitter.
The paper was made deliberately boring, and published without a blog post that normally accompanies milestone results from OpenAI.[36]
As pointed out in Shelvane (2022), the GPT-3 Google search trend in 2020 indicates how the interest in GPT-3 only rose to a significant level about seven to eight weeks after the paper was published on May 28, 2020. The relative search interest sat around 1-2% between May 28 and July 11, then exploded from 1-2% to 36% after Jul 11, and then peaked at 100% between Jul 19-25.[37] This trend correlated with Twitter activity involving GPT-3. Shevlane (2022) writes: “I downloaded around 63,000 tweets mentioning ‘GPT-3’ from Twitter’s API, from the period 12th-22nd July 2020. The number of tweets mentioning GPT-3 climbed from close to zero at the start of this period to a spike of about 900 (per 3 hour interval) around July 20th. [...] The tweets I found with the most engagement (in terms of retweets and likes) were early users of GPT-3 who were demonstrating GPT-3’s ability to write functioning software code. This was a much more accessible demonstration of GPT-3’s capabilities than the paper had given.”[38]
I’m very uncertain whether this hype strongly influenced the subsequent R&D decisions of specific leading AI developers. My best guess is that the knowledge of GPT-3’s existence sped up both DeepMind and Google’s work scaling up language models by six months (90% CI: 1–18 months). But I have not been able to distinguish whether this acceleration was driven by insider knowledge, or the publication of GPT-3, or the hype generated after publication, or some combination of those factors. In addition to the surprisingness and hype of GPT-3 argued above, I have the following evidence for this claim:
A researcher who has trained large language models at an AI safety lab told me: “I think GPT-3 probably pushed other labs in this direction about a year earlier than they otherwise would have. It’s a bit hard to know for sure. There were certainly other groups training larger and larger LMs each few months and they were doing better and better, but it wasn’t obviously clear to everyone that scale was the main ingredient there.” (Note that this claim of “a year earlier” had a small weighting in my estimate of when the equivalent of GPT-3 was expected to be published, stated in a point below.)
Geoffrey Irving (last author of the Rae et al., 2021) telling me that “GPT-3 did add an organizational push” for DeepMind to scale up their language models.[39]
I also have one piece of countering evidence, but I don’t think this outweighs the favoring evidence. I asked Iulia Turc—a former Software Engineer at Google Research who worked with language models such as BERT: “Do you think that GPT-3’s increased model size, and the resulting improvements in task performance, generality, and reduced need for fine-tuning, was surprising to researchers at the cutting edge of natural language processing?” Turc responded: “I don’t think it was surprising, I think it was impressive from an engineering point of view.”
I estimate that GPT-3 arrived 11 months (90% CI: 5 to 17 months) earlier than expected, mostly based on trends in the amount of training compute used for ML systems at the time immediately before GPT-3 was publicized (see this appendix).
I used the estimate of “when a GPT-3 equivalent was expected” above as a strong prior for “how much GPT-3 sped up DeepMind and Google’s work scaling up language models”. But after intuitively accounting for the evidence in the above quotes from experts, I made the following updates to reach my final estimate of six months (90% CI: 1 to 18 months):
The median estimate of the speed-up should be earlier, because (a) Iulia Turc didn’t think GPT-3 was very surprising in terms of scale or performance, (b) the estimate of “when a GPT-3 equivalent was expected” doesn’t fully account for the growing interest in pretrained large language models among top AI developers since around 2018 (when OpenAI’s original GPT (Radford and Narasimhan, 2018) and Google’s BERT (Devlin et al., 2018) were published).
The confidence interval should be wider, given that I have almost no knowledge of what DeepMind and Google’s plans around language model scaling actually were around the time that GPT-3 was published.
Diffusion cascades: the publication of progress accelerates the diffusion of the final product
Here I introduce the concept of a diffusion cascade: the acceleration of diffusion that results from diffusion of artifacts that are relevant to producing a given closed-source model. The concept of a diffusion cascade applies when initially there is a given closed-source model that is only accessible to one actor, and no other actor fully understands how to produce that model and/or has all the resources needed to produce that model.[40] The incremental progress and open sourcing made by other actors in the meantime fills in the gaps in knowledge and resources, and thereby accelerates diffusion. Even if the latest capability advance is only reachable by leading AI developers initially, those leading developers can make diffusion to other actors happen more easily and sooner than otherwise.
Tools, datasets, smaller models, and the accumulation of published details speed up the cascade
Below I list some specific drivers of diffusion cascades, and empirical examples of those drivers being involved in diffusion cascades. I also indicate the current relative importance of each driver on a sub 1-5 scale (5 is most important) according to my judgment, which is based on a combination of independent reasoning and the empirical examples. Importance means how much this driver has accelerated diffusion empirically.[41]
Open-source software tools. (Importance: 5) While there are long-standing open-source tools for machine learning such as PyTorch, more specific open-source tools specialized for large language model training can emerge, which embed a lot of knowledge of how to train large language models. Megatron-LM and DeepSpeed are open-source tools for training large language models, and were used extensively to train GPT-NeoX-20B, OPT 175B, Jurassic-1-Jumbo, and BLOOM. Sid Black told me that while he had qualms with Megatron-LM and DeepSpeed (namely, they were frustrating and time-consuming to use and extend), Megatron-LM is “really fast” and he was glad to have these tools available when developing GPT-NeoX-20B.
Accumulation of insights and implementation details from different research articles. (Importance: 4) Even if there is a long series of closed-source language models developed by different actors, the current tendency is for many of those actors to publish research articles with information about their methods (more on this in the post on publication norms and release strategies). Due to the various independent decisions about what information is included in these research articles, more and more information on how to reproduce a given model can gradually be accumulated.
Example: Narayanan et al. (2021). The paper accompanying the release of the Megatron-LM tool includes information on different types of parallelism methods and how they can be composed to scale to “thousands of GPUs and models with trillions of parameters,” and “intuition as to how to configure distributed training of a large model.” This paper does not itself present new models, it just provides insight on how to scale and train them efficiently.
Open-source smaller models. (Importance: 3) Many pretrained language models that are smaller but similar in design to GPT-3 are open-source—for example, GPT-2, and the OPT family (except OPT-175B, which isn’t smaller than GPT-3). Having these models (and the code to instantiate the models) available makes the precise implementation of those models clearly and completely known, beyond just specifying the model architecture and its hyperparameters in a research paper. However, if the smaller model falls significantly short of the full model in performance, the full model normally needs to be trained from scratch,[42] so my impression is that having smaller models available does not necessarily reduce the challenge of scaling up. Empirically, the publication of smaller models is only of moderate importance, because the current norm is to publish model architecture details in research papers (including for the larger models, even when the model weights aren’t published), and that saves most of the work in figuring out how to implement a model.[43]
Open-source datasets. (Importance: 3) For example, The Pile was used to train GPT-NeoX-20B and (partially) OPT 175B (Gao et al., 2020). Although such datasets for language models usually just consist of text data scraped from public internet sources, scraping the data and storing it in an appropriate format is a significant effort.
Coordinating on greater secrecy, even just delayed publication, can slow down diffusion
The obvious way to slow down a diffusion cascade, and diffusion in general, is to have greater secrecy. In the absence of coordination, the best that one actor can do on this front is to try to keep knowledge of a project or model completely secret, not even revealing the model’s existence.
My impression is that it is not uncommon to keep models secret temporarily (i.e., delaying publication past the minimum time needed to produce a publication).
For example, the GPT-3 175B model was not announced for “months” after it was trained, and this seemed partly motivated by a desire to delay progress toward artificial general intelligence.[44] My low-confidence best guess is that the paper was published seven months after training finished and could have been ready to publish four months sooner than it was if the work towards publishing the paper was done as soon as possible.[45]
The publication of Gopher was delayed even longer than my estimate for GPT-3. Based on the Gopher model card, the paper was published 12 months after the model finished training.[46] So by similar logic, I think the Gopher paper could have been published nine months sooner than it was. I speculate that the delay in publication about Gopher was for the same reason as not releasing the training code, dataset, and model weights for Gopher. Geoffrey Irving told me that the reason for the latter was to “[reduce] diffusion of objects that can cause harm if not aligned further.”
A staff member at an industry AI lab, who has worked with large language models, told me off-hand that publication of Google’s PaLM model was probably delayed by a couple of months, but this is weaker evidence and I did not find out the rationale for the delay.
One thing to note here is that while a model may remain secret to the general public until it is published, I suspect that information does sometimes leak, especially among peers in AI development at different labs.[47] Rumors can also circulate, even to the public, though it’s unclear when this is intentional and when it is unintentional. For example, Hao (2020) seems to refer to the text-to-image model DALL-E (or similar preliminary work) 11 months before DALL-E was announced (Ramesh et al., 2021).[48]
Besides just delaying publication, actors could limit diffusion cascades (if that is their goal) through more comprehensive secrecy around information and resources—even if the existence of the model and research results about the model are publicized. Given the various information sources and artifacts that can drive a diffusion cascade, it would be more effective to not just keep the model secure, but also e.g., the specialized software tools that were used to train the model, and the datasets, and the details of training infrastructure and parallelism strategies. For example, the developers of GPT-3 did not explain or open-source the software tooling that was used to train the GPT-3 model. This seems to have left a gap that Narayanan et al. (2021) had to spend time filling (i.e., with the Megatron-LM codebase).
Appendix: GPT-3 came 5–17 months earlier than expected, due to OpenAI’s willingness to spend on the compute and to solve the engineering challenges
I used 3 methods to estimate when experts would have expected GPT-3 (or the rough equivalent) to be released, immediately before GPT-3 was actually publicized. Estimating this provides evidence about the extent that multiple discovery was involved in the diffusion of GPT-3-like models, and about the counterfactual impact of publicizing GPT-3. The estimates are detailed in the following subsections.
Expected timing based on the average training compute trend
First I analyze how unexpected GPT-3 was in terms of the average trend in training compute for models over time. My analysis is based on this interactive plot of compute trends by Epoch. Below are the initial steps I took and the results I obtained from different plots:
Click the three-bar menu in the top right of the plot to open the settings
Check “Separate by category” so that the Language domain data has its own trends
Uncheck “Split trendlines in Large Scale Era”
Set “Large scale” to “ignore” so the red “Large Scale” trend disappears
Set the x-axis maximum to just before April 2020 using the slider at the bottom, such that all language models up until GPT-3 175B are included, but GPT-3 175B itself is excluded.
At the time of writing, there is a typo in the database used for this data which sets the publication date of GPT-3 to April 28, 2020 rather than May 28, 2020. I don’t think this affects my conclusions significantly.
You may have to zoom into the plot with the scroll wheel to verify this.
Alternatively, set the “endDate=” part of the page URL to an exact value, e.g. “endDate=2020-3-31”
The resulting Language domain trend in the Deep Learning era is 0.8 OOMs/year
Using a straight edge to visually extrapolate the Language trend, I find that the trend predicts 3E+23 FLOPs of compute would be reached by about October 2021—17 months after the actual publication date of GPT-3 in May 2020.
Weight on this estimate: 0.4. Higher than average because I think the domain-specific trend is more reliable. The greater number of samples from the full Deep Learning Era also makes it more reliable.
Now check “Split trendlines in Large Scale Era”. The “Large Scale Era” Language trend should now be 1.1 OOM/year. Link to these settings is here.
Prediction (using the same extrapolation method as above): about February 2021, nine months after actual
Weight: 0.2. This is a more “inside view” trend which I think is plausible. It takes better account of the large scale models that were released more recently. But the sample is slightly smaller so the prediction is not as reliable.
Use the one trend in the “Large Scale Era”—0.4 OOMs/year
Prediction: October 2026, which is 6 * 12 + 3 = 75 months after actual
Weight: 0.1. The data looks very noisy, spans a short time period, and doesn’t account for domain-specific trends. But it is still an informative “outside view” estimate.
Now Uncheck “Split trendlines in Large Scale Era” (link)
Use the one “Deep Learning Era” trend
Prediction: February 2023, which is 36 − 3 = 33 months after actual
Weight: 0.2. To me this is a stronger “outside view” prediction than the previous, because there are more samples.
Now set the “Large scale” dropdown setting to “label” and use the “Large Scale” trend of 0.3 OOMs/year (link)
March 2022—24 − 2 = 22 months after actual
Weight: 0.1. Small sample size, but still an informative estimate based on the belief that the “Large Scale” trend is more relevant.
Filtered standard deviation of estimates (i.e. excluding the 75 month estimate): 10 months
I used the weighted average as the central estimate, and the filtered standard deviation to get 90% confidence bounds. Thus my first estimate for the expected arrival time of GPT-3 is June 2022 (90% CI: August 2021 to April 2023). A major limitation of this estimate is that I am using a prediction of the average milestone system rather than a prediction of the most expensive system. Including the “Large Scale” trends in my aggregate prediction compensates for this somewhat (because the “Large Scale” data has the most expensive systems), but the above average predictions are probably still later than experts actually expected. Due to this limitation, I only put 30% weight on this estimate.
Expected timing based on the upper range of the compute trend
One way to improve on the first estimate is to look at when the trend predicts GPT-3’s training compute minus some amount of deviation based on the variance in the data. Due to time constraints I have not computed a confidence interval in the trendline. However, visually inspecting the Language category data over the whole “Deep Learning era” in this plot, we can see that data points that are about 1 order of magnitude above the trend line are common. For example, Meena in Jan 28, 2020 has 1.1E+23 FLOP while the trend is at about 1E+22 FLOP, and Seq2Seq LSTM in Sep 10, 2014 has 7.3E+18 FLOP while the trend is at about 4E+17 FLOP. The biggest outlier is GNMT (Sep 26, 2016) at 6.9E+21 FLOP when the trend is only at about 2E+19 FLOP; however, I think this is too large an outlier to significantly weight people’s best-guess expectations about when GPT-3’s amount of training compute would be used.
Based on this rough inspection, I will just look at when the trendline predicts one order of magnitude lower than the true value, i.e., when it predicts 3E+22 FLOP rather than 3E+23 FLOP. This appears to occur in late July 2020, only 2 months after GPT-3 was actually published.
Based on this, I chose 2 months as my central estimate for the time that GPT-3 was expected (in terms of training compute), relative to when it was actually published. Like the first estimate, I used the filtered standard deviation of 10 months to get confidence bounds. Thus my second estimate for the expected arrival time of GPT-3 is July 2020 (90% CI: December 2019 to May 2021). Although this estimate is less rigorous than the first estimate, I think it is closer to the quantity I’m actually trying to estimate, so I put 50% weight on it.
One expert opinion
Finally, I have some evidence about the expected timing of GPT-3 from one researcher who has trained large language models at an AI safety lab. They told me: “I think GPT-3 probably pushed other labs in this direction about a year earlier than they otherwise would have. It’s a bit hard to know for sure. There were certainly other groups training larger and larger LMs each few months and they were doing better and better, but it wasn’t obviously clear to everyone that scale was the main ingredient there.” This isn’t a direct claim about when GPT-3 was expected to arrive, but their statement suggests that if GPT-3 was published 1 year later, then that would be more in line with the expectations of the field. As with the other estimates, I will put a confidence interval of +/- 10 months either side of this 12-month estimate. So my third estimate is May 2021 (90% CI: July 2020–March 2022). Since this is based on an off-hand comment from one expert, I only put 20% weight on it.
Overall estimate: 11 months (90% CI: 5 to 17 months) sooner than expected
I put my three estimates together in a weighted average using this Guesstimate model and obtained an overall estimated delay of 11 months (90% CI: 5 to 17 months), or an estimated date of April 2021 (90% CI: October 2020 to October 2022). Note that the confidence interval does not account for the correlation between the confidence intervals of the individual estimates, and the correlation between the first and second estimates (due to using the same data and trend), so it probably should be wider to reflect my true confidence.
What this overall estimate implies is that GPT-3 arrived significantly earlier than expected. I think that the most likely reason for this unexpected event is OpenAI simply being willing and able to invest in a larger amount of compute. The “willing” part is probably the key factor in OpenAI getting to this amount of compute before other leading language model developers just prior to GPT-3’s release, especially Google.
Acknowledgements
This research is a project ofRethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please considersubscribing to our newsletter. You can explore our completed public work here.
Note that this is not a fair comparison with talent holistically. Talent can be the key bottleneck even when salaries are only a small fraction of project costs, due to the time and financial cost of producing enough people with the requisite skills. Further analysis of the holistic talent cost seems worthwhile in future work.
Sponsorship of compute resources could involve an actor doing any of the following things: (a) giving another actor ownership of compute hardware, (b) giving another actor access to compute hardware, (c) giving another actor money that can only be used on compute, or (d) giving another actor money with the intention that it is used for compute. Only cases (b) and (c) occurred in my case studies.
E.g., Beijing Academy of Artificial Intelligence (BAAI) and Peng Cheng Laboratory (PCL) were involved in the GLM-130B and ERNIE 3.0 Titan models respectively. See my survey of models covered previously for details.
I won’t make the effort to detail all these insights, but note that the Gopher paper (Rae et al., 2021) is titled “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”.
I assessed which models are GPT-3-like in a previous post. The nine GPT-3-like models are Gopher, Hyperclova, Jurassic-1-Jumbo, Megatron-Turing NLG, LaMDA-PT, Yuan 1.0, ERNIE 3.0 Titan, Chinchilla, and PaLM.
In a previous post, I estimated that 1000 (90% CI: 200–3000) people could be eligible to access the model weights of OPT-175B, and all of these people could be granted access in the first year following release of OPT-175B. I don’t know what number of people are actually permitted to access OPT-175B so far (i.e., who’ve requested and been granted permission) and it’s very likely lower than the number of people that could be eligible, but as of November 2022 I think that number is more than 80% likely to be higher than 73, which is the total of “core team size” for the models that I estimated “core team size” for (see this cell of the diffusion database).
See Wiblin and Harris (2022): Rob Wiblin: “Are there any historical case studies of information leaks in ML? Are there any cases where an ML model has been stolen in the past?”. Nova DasSarma: “That’s a great question. I don’t think I can think of one offhand actually. If they have been stolen, then it’s one of those things where they’ve kept hush-hush about it.”
Paraphrasing from personal correspondence: Ben Cottier: “Do you know any examples of hackers accessing ML-related artifacts like datasets, trained models, etc.?” Jeffrey Ladish: “Ram Shankar Siva Kumar from AI Red Team at Microsoft—they used phishing to steal a model etc. That’s the only example I know of.” I found Field (2022) related to what Jeffrey Ladish was referring to. This isn’t a “real world case of ML model theft” in that it was a red-teaming exercise and didn’t actually result in diffusion to unauthorized parties.
I think doing this in four months would probably be feasible, based on my estimates of training wall-clock time and total project duration (i.e., time until having the trained model; this excludes time for writing and publishing a paper) in the diffusion database. The case with the most confident estimates is OPT-175B, with a total project duration of 78 days, including 33 days of training time. However, there were four months from OPT-175B completing training to the paper being published in May 2022. So my estimate of one month to evaluate the model and publish is probably too short.
Geoffrey Irving (Safety Researcher at DeepMind) told me that “[People who worked on Gopher] had already started LLM scaleup for the purpose of using them for communication and recursion-based alignment schemes soon after I joined [DeepMind, from OpenAI, in October 2019], but GPT-3 did add an organizational push.”
See Shelvane (2022). A senior member of OpenAI (who is specified on p.27 of the PDF) told the author: “GPT-3 existed for a long time before the paper came out. We delayed the paper. [...] But it’s months, it doesn’t really count. And you’re sitting there, fucking white-knuckling it, because it’s really costly if someone releases their paper, and you have fucked this up somehow. So you’re under pressure” (p.66 of the PDF).
This is just a rough estimate, and expecting a result to be published by a certain date does not guarantee that no other equivalent model would have been published otherwise. Nonetheless, it is evidence in the direction of “multiple discovery was not involved in any cases of GPT-3-like model diffusion”.
I focus on development rather than access to GPT-3-like models here because I think development is more important. See a previous post for my reasoning on this.
In my case studies there is a close relationship between the factors for diffusion and the resources that drive capabilities (i.e., money, compute, data, and talent). I think this is due to replication and incremental research being the main mechanisms of diffusion for 2 years. The actors involved had to actually develop models independently in order for the models to diffuse, because there weren’t any open-source models for a while. But if the main diffusion mechanism happened to be espionage, then an accelerating factor might be the poor information security at an organization. So the factors for diffusion and the resources that drive capabilities can be quite separate.
This is because OPT-175B allows more people to get direct access to its model weights, and finding model weights seems to be the most compute-intensive aspect of AI development/deployment.
See the “Training cost (2022 USD)” column of the diffusion database, noting which models are classified as GPT-3-like in the “GPT-3-like model?” column. Some GPT-3-like models in the database do not have cost estimates, but seem very likely to fall within the $1–10M cost range given their training compute (see the “Training compute (FLOPs)” column).
Note that this is not a fair comparison with talent holistically. Talent can be the key bottleneck even when salaries are only a small fraction of project costs, due to the time and financial cost of producing enough people with the requisite skills. Further analysis of the holistic talent cost seems worthwhile in future work.
Black indicated this rough 40–50% confidence after seeing a draft of this text (which included my skepticism about Black’s claim). Black originally told me (paraphrasing from conversation) that “We did kinda become bottlenecked by compute—if CoreWeave had offered more GPUs, we probably could have [replicated GPT-3].” I interpreted the word “probably” to be more than 50% confidence.
See Shelvane (2022, p. 73): “The greatest bottleneck has been getting access to enough compute. Initially Eleuther was still using Google’s TFRC scheme. This was not sufficient…”
Shelvane (2022, p. 73): “[CoreWeave] planned to buy more NVIDIA GPUs and rent them out to people training large models. Connor told me: ’So, the deal was: we test the hardware, we figure out what do you need to train these kinds of models . . . because they don’t have in-house capacity ML engineering talent. And then they buy [the hardware]. We get to train our model on it and release it for free. And everyone’s happy.”
Shelvane (2022, p. 40): “I asked Aaron [one of the Brown University graduate students that did a project replicating GPT-2] what value the Google’s TFRC team would have seen in the project: ‘To test the systems, and just like...They just want to get more papers out there on it that can only be done on TPUs, because if you’re a company and you want to iterate on that for your own personal thing then you have to pay them to use TPUs. That’s basically it—that’s basically the value in general.’”
Sponsorship may also be important in the sense that it increases the number of people working on larger-scale AI projects, which may increase the number and expertise of AI engineers and researchers, which may then get hired by the leading AI labs.
On p.2 of the paper it says “We open source our code along with the training and evaluation pipelines at https://github.com/megatron-lm”. That link is broken, but version 4 of the paper (Shoeybi, 2020) changes the link to https://github.com/nvidia/megatron-lm, so I assume that these links correspond to the same codebase which has been updated over time.
See Shelvane (2022, Ch 2 p. 3 or p. 66): “In addition to delaying the paper, another strategy was to write the paper in a way that avoids attention-grabbing. The paper was written so as to avoid ‘hype’ and include discussion of the model’s weaknesses.”
Another interesting aspect of the search trend is the regions. China was the region with the highest fraction of total searches; 2nd was the interest in South Korea at 34% of China’s, and ranking 17th was the US at 11% of China’s. However, note that there are many small countries that rank highly because the metric used is the fraction of total searches in the given region.
Full correspondence is available upon request. Irving was not clear what exactly is meant by “GPT-3” in that claim—whether it was insider knowledge of GPT-3 before GPT-3 was published, or the publication of the paper, or the huge publicity after publication, or some combination of those events.
Note that I haven’t tried to predict how important each type of artifact will be in future diffusion cascades; I leave that to potential future research.
From my limited understanding of the Transformer architecture and how the architecture tends to be scaled up, it is conceivable that learned weights from a smaller model could be copied into a larger model, with the extra weights starting from initial values. But even if it’s possible, I don’t think this would be as effective as training the full-size model from scratch, because I have not heard of this method being used effectively.
This claim is based on all nine of the large language models that I studied in-depth detailing their model architecture and associated hyperparameters—see this column in the diffusion database.
Shelvane (2022, Ch. 2 p.3, or p.66): “Proponents of AGI risk will sometimes criticise OpenAI for contributing too much to advances in AI capabilities [...] It appears that these kinds of considerations did inform the way that GPT-3 was shared. [an OpenAI staff member] told me: ‘GPT-3 existed for a long time before the paper came out. We delayed the paper. That was one of the things we could do for AGI stuff. But it’s months, it doesn’t really count.’”
My best guess is that the GPT-3 175B model finished training in October 2019, seven months before publication in May 2020—my reasoning is in the note of this cell of the diffusion database. I guess that the evaluation and paper-writing process took about three months in total, based on my intuition of how long different steps take. I think this is longer than most AI research papers, but the paper is long and seems to have required unusually high effort. That implies a four-month delay in publication.
The Model Card in Appendix B of the paper (p.49) states the “Model Date” is December 2020, and according to the paper that introduces Model Cards this means “When was the model developed?” I interpret “developed” as the date that the model finished training—this interpretation is partly based on another detail from the Gopher paper (Rae et al., 2021): “We trained Gopher for 920 hours in November and December 2020 in Google’s Georgia datacentre.” (Appendix F, p.103)
This is based on at least two AI developers at leading AI labs agreeing with me in informal conversation that this does sometimes occur, but I do not have any record of those conversations.
The article states “One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources.”
Drivers of large language model diffusion: incremental research, publicity, and cascades
This post is one part of the sequence Understanding the diffusion of large language models. As context for this post, I strongly recommend reading at least the 5-minute summary of the sequence.
Key takeaways
Up until the release of OPT-175B in May 2022, incremental research (i.e., research that makes a relatively small change to an existing method) was the prevailing diffusion mechanism for actors to gain direct access to the weights of a GPT-3-like model; nine GPT-3-like models were developed in that way prior to OPT-175B, of which none had their weights made widely accessible.[1] The wider accessibility of OPT-175B changed the prevailing mechanism to open publication, based on my estimate that more actors have direct access to the OPT-175B model weights than actors that have developed GPT-3-like models themselves. (more)
I don’t think that multiple discovery (i.e., two actors independently coming up with the same idea or result) was significantly involved in the diffusion of GPT-3-like models. In particular, I think if the relevant papers weren’t published, it would’ve been 6 months (90% CI: 1 to 18 months) before any other actor would’ve discovered either a model with GPT-3’s capabilities or the scaling laws it was based on.
It’s plausible that the publication of the GPT-3 and scaling laws papers was unnecessarily early in terms of beating other actors to the punch, but I don’t have enough evidence to be confident in that claim. Regardless, I think that developers should have more scrutiny about whether they are really in a race to publish, because the harm of accelerating AI capabilities could outweigh the benefit of publishing first with a more responsible strategy (in order to establish better publication norms).
Access to compute appears to have been the main factor hindering the diffusion of GPT-3-like models. The next biggest hindering factors appear to have been acquiring the necessary machine learning and engineering expertise. (more)
I estimate that compute makes up 87% (90% CI: 64% to 98%) of the combined compute cost and salary cost of GPT-3-like model development.[2]
The largest accelerating factors in the cases I studied (i.e., factors that aren’t necessary for developing GPT-3-like models but that seemed to make development easier or more likely) are, in order of apparent importance, (1) publicity about GPT-3’s capabilities, (2) the sponsorship of compute resources, and (3) the release of open-source tools for large-scale model training. (more)
My best guess is that the release of GPT-3 sped up both DeepMind and Google’s work on language model scaling by six months (90% CI: 1–18 months). My guess is based on (1) how unexpected GPT-3 was in terms of training compute, (2) the hype surrounding GPT-3 following its publication, and (3) comments from three people who have done research on large language models.[3] (more)
So far in the West, the sponsorship of compute has only helped academic groups, independent groups, and smaller AI companies catch up to where leading AI labs were in 2020. This can allow them to use models closer to the cutting edge than they’d otherwise have, to do research on such models, and to increase the number of people with access to these models (e.g., as happened with BLOOM open-sourcing its weights). In the future, sponsorship could give an academic institution or AI startup most of the resources they need to play a significantly larger role in AI development and diffusion. It is also an easy way for governments to play such a role via offering resources, even if they lack key talent in-house. This seems true most of all in China, where there are strong ties between the government and the AI research community.[4] (more)
Diffusion of closed-source GPT-3-like models has been accelerated by incremental progress in, and open publication of, artifacts that are relevant to a given model. Relevant artifacts include datasets, smaller models, specialized software tools, and the accumulation of published method details (e.g., parallelism strategies). I call this process a diffusion cascade—diffusion of model-relevant artifacts begets diffusion of the model itself. Diffusion cascades can be limited by minimizing the spread of model-relevant artifacts (rather than only avoiding publishing model weights or algorithmic insights). (more)
In addition to never publishing, delaying publication can be, and has been, successfully used to limit diffusion. (more)
I estimate that if, after GPT-3 was trained, the GPT-3 project team had done the necessary work to publish the GPT-3 paper (Brown et al., 2020) as soon as possible, it could have been ready to publish four months sooner than it was.
Similarly, I estimate that the Gopher paper (Rae et al., 2021) could have been ready to publish nine months sooner than it was (holding constant the time at which it was actually trained).
Both of these publication delays seemed to be (partly) motivated by a desire to delay wide access to a powerful model, until the potential harms of that model were better understood or more easily mitigated. I also think it’s likely that both of those delays significantly slowed diffusion of GPT-3-like models given (a) how much of an impact GPT-3 itself had, and (b) the additional insights about training large language models that the Gopher paper presented.[5]
The prevailing diffusion mechanism for GPT-3-like model weights was initially incremental research, then open publication
Up until the release of OPT-175B in May 2022, incremental research had been the prevailing diffusion mechanism for gaining direct access to the weights of a GPT-3-like model. After the release of OPT-175B, the prevailing mechanism has been the combination of replication and open publication. What follows is my reasoning and further thoughts on the mechanisms of diffusion of GPT-3-like models:
Incremental research can be seen as a variant of replication, where actors are probably capable of replication, but are more incentivised to surpass or otherwise differentiate themselves from the existing result.
While OPT-175B is an explicit replication attempt, all nine other GPT-3-like models I identified that came before it are cases of incremental research: for example, Gopher and PaLM are much larger models at 280 and 540 billion parameters respectively; Jurassic-1-Jumbo uses a different tokenizer to enhance the vocabulary of the model; Chinchilla uses a more compute-optimal training method.[6]
My guess is that the main reasons to do incremental research rather than replication are:
Improving performance on some of the same tasks that the original model performed, to make the new result more useful or impressive
Improving performance for different tasks or mediums, e.g., Chinese language rather than English language, to make the new result more useful or impressive
Making research results (seem) as novel as possible, so the work gets more attention and is more likely to be accepted to publication venues
Open publication of a GPT-3-like model required replication first, because the original developer (OpenAI) did not do open publication of GPT-3’s model weights. This state of affairs persisted until the first (somewhat) open publication of weights, with OPT-175B in May 2022. My understanding from the request form is that OPT-175B is available to AI researchers with at least one publication that is at minimum broadly relevant to OPT-175B. So I expect that the number of people with access to OPT-175B is now greater than the number of people who have worked on producing a GPT-3-like model from scratch.[7] Open publication is therefore now the prevailing mechanism of diffusion for GPT-3-like models.
I am not aware of any cases of leak, e.g., someone being granted access to a closed-source model and then publishing that model themselves without permission. This is based on not coincidentally hearing or reading about such a case in the course of my research.
I am not aware of any cases of theft or model stealing attacks on a GPT-3-like model. This is based on:
Not coincidentally hearing or reading about such a case in the course of my research
Nova DasSarma (who works on security at Anthropic) not recalling any cases of ML model theft offhand during an interview (with the caveat that a fully successful case of theft would go undetected, so we can’t know for sure)[8]
Jeffrey Ladish—who works on security for Anthropic—also not thinking of any real-world cases of ML model theft in conversation[9]
The extent that multiple discovery was involved in the diffusion of GPT-3-like models is more uncertain than the mechanisms above. However, after accounting for the evidence detailed below, I believe multiple discovery was not significantly involved in the diffusion of GPT-3-like models. In particular, I think if the relevant papers weren’t published, it would have taken six months (90% CI: 1 to 18 months) before any other actor would have discovered a model with the approximate capabilities of GPT-3.[10]
Below are reasons to think multiple discovery was involved in at least one case of GPT-3-like model diffusion:
Arguably the key insight behind GPT-3 was Scaling Laws for Neural Language Models (Kaplan et al., 2020)—the scaling laws implied that more-or-less direct scaling of compute, data, and parameter count from GPT-2 to GPT-3 would predictably achieve a lower loss, which is correlated with better performance on downstream tasks. The scaling laws paper was published to arxiv.org on January 23, 2020, four months before GPT-3 was publicized in May 2020. This plausibly allows just enough time for an actor that has already developed GPT-2-scale models to notice this insight and scale up to GPT-3 (e.g., two months to prepare, one month to train, and one month to evaluate and publish).[11]
It seems that predecessors to Gopher (a GPT-3-like model from DeepMind) were already being developed before GPT-3 was publicized.[12]
There is some evidence that people at OpenAI were worried about other actors developing a GPT-3-like model first, though it’s unclear to me how justified the concern was.[13]
Reasons to think multiple discovery was not involved in any cases of GPT-3-like model diffusion:
Geoffrey Irving (last author of Rae et al., 2021) told me that GPT-3 “did add an organizational push” for DeepMind to scale up language models to Gopher’s scale (which was 280 billion parameters and 5.8E+23 FLOPs of compute). This suggests that DeepMind would have produced a GPT-3-like model later, and certainly not earlier, if GPT-3 had not been published.
To my knowledge, nobody publicized a GPT-3-like model (according to my definition) until HyperClova, one year after GPT-3 was publicized (Naver, 2021). After that, many more GPT-3-like models were publicized. Based on my estimates of “Time from project start to final trained model”, one year is more than enough time to develop a GPT-3-like model. This suggests that these projects probably started after GPT-3 was publicized.
I estimate that GPT-3 arrived 11 months (90% CI: 5 to 17 months) earlier than expected, mostly based on trends in the amount of training compute used for ML systems at the time immediately before GPT-3 was publicized (see this appendix).[14]
Iulia Turc, a former software engineer at Google who worked on research using large language models, told me: “Somebody else would have inevitably reached the same scale [as GPT-3], but I really can’t make an educated guess about when. Research labs like Google clearly had the resources to do it even before OpenAI, but it’s unclear to me whether it would have been a priority.”[15] To me this suggests that Google had not already produced a model of the same or larger scale when GPT-3 was published, but on the other hand, there were actors (including Google) with the ability to do so. So the evidence from this quote seems roughly neutral on the question of multiple discovery overall, but informative nonetheless.
It’s possible that the publication of GPT-3 caused another developer to withhold publication of their own similar result, even though they were planning to publish at almost the same time—say, within two months. The reason the other developer might do this is to avoid losing recognition for their result, because it’s too similar to GPT-3. Instead, it might be better for them to do some further research and then publish a more novel result that gains more recognition. Doing this further research seems more likely than giving up on the project entirely, due to the sunk cost of training such a large model. However, I think it would most likely take less than six additional months for this further research to happen, and in fact no relevant publications came out within six months. So it seems less than 50% likely that this scenario happened.
My specific quantitative claim—that it would have taken 6 months (90% CI: 1 to 18 months) before any other actor would have discovered a model with the approximate capabilities of GPT-3—is based on (and the same as) an estimate I make in a later section about the impact of GPT-3’s publication.
It’s plausible that the publication of the GPT-3 and scaling laws papers was unnecessarily early in terms of beating other actors to the punch, but I don’t have enough evidence to be confident in that claim. Regardless, I think that developers should have more scrutiny about whether they are really in a race to publish, because the harm of accelerating AI capabilities could outweigh the benefit of publishing first with a more responsible strategy (in order to establish better publication norms).
Caveat: I expect the above conclusions to change when it comes to the diffusion of future state-of-the-art language models, due to:
More closed publication practices: in the next post of this sequence I’ll argue that publication decisions by top language model developers will become more closed on average than they are now. I think this will make incremental research relatively more prevalent compared to replication and open-sourcing.
Greater incentive for theft: while I’m not aware of any cases of model theft so far, I expect that the incentive for theft will increase as the capabilities of models improve and state-of-the-art models continue to be closed-source. Improved capabilities will increase the payoff of theft. And I expect that, although leading AI labs will take measures to improve security around their models, there will be a point (if we are not there already) where the cost of attempting to steal the model may be lower than attempting to replicate the model—at least for the most capable hackers, among which are state actors.[16] These claims are uncertain, and I think at least one month of further research on risks from theft would be worthwhile for someone in the AGI governance community to do.
In contrast to the above changes, I expect the diffusion of models with similar performance to GPT-3 (rather than greater performance) will accelerate in the future.
Hardware costs will fall and algorithmic efficiency will continue to improve, enabling more and more actors to develop these models. I also expect there will be diffusion of better open-source tools that make it easier to train and run these models (similar to Megatron-LM). Presumably, many actors will then openly publish their models for the usual reasons, enabling even more actors to acquire and use the models.
I also expect that the incentive for replication will decrease in the future as different GPT-3-like models are trained to address various use cases and languages, and those models will also get open-sourced for the usual reasons.
Most important factors for GPT-3-like model diffusion
Below I discuss the most important factors for diffusion that I determined in the course of my research and that fell within my scope. Note that these are factors that made developing GPT-3-like models easier or more likely by the largest margin in various cases.[17] I don’t consider the core resources for developing GPT-3-like models as “factors” themselves—those resources (mainly compute and talent) are discussed in the previous post. Overall, I’m 80% confident that all of these factors are important enough for a longtermist researcher to spend at least one month full-time thinking about how to beneficially affect each of these factors.[18]
Difficulty accessing enough compute (hindering factor)
I think that the difficulty of accessing enough compute has been the largest hindering factor to the diffusion of GPT-3-like models. This was the case up until the release of OPT-175B in May 2022, after which GPT-3-like models became much more accessible.[19] My claim is based on the following evidence:
The actors that have succeeded in producing GPT-3-like models have all needed on the order of $1–10 million available to spend on compute.[20] This cost is much larger than the cost of labor or acquiring training data, according to my estimates (shown below). Furthermore, no major algorithmic insights needed to be figured out once the GPT-3 paper was published.
If we compare compute with the cost for talent—measured just in terms of the total salary of the project team for the duration of the project—compute seems to be a much larger hindering factor in this domain.[21] This Guesstimate model compares labor to compute cost for the average project to develop a GPT-3-like model. It suggests that the total compute cost is 16x (90% CI: 3x to 81x) higher than the labor cost. However, this model does not account for the difficulty of acquiring specific talent in large language model training. I explore the barriers to acquiring talent (with no clear conclusion) in this appendix. Talent and compute cost are also partially exchangeable, as I discuss in this section.
In this appendix, I estimate that the cost of producing the unprocessed GPT-3 training dataset (including human labor) is one to two orders of magnitude lower than the compute cost for the final training run of GPT-3. Based on this, I am 90% confident that, for all other GPT-3-like models I investigated, producing or acquiring the dataset cost at least one order-of-magnitude less than the compute cost for training that model, given that all of these models seemed to use similar raw data or similar data-collection processes to GPT-3.
EleutherAI has so far failed to replicate GPT-3 because of limited access to GPUs (both due to CoreWeave’s budget, and chip supply shortages).
The PanGu-alpha model apparently failed to reach its full potential (given the parameter count of 200 billion and the dataset size of 1.1TB[22]) due to being undertrained—i.e., not enough compute was spent to train the model on an adequate number of tokens. I think this is most likely due to one or more of the following possibilities: (a) the authors ran out of time to complete the project, (b) the authors did not have the financial budget to train further, and/or (c) there was a technical problem during training that the authors did not know how to fix (before a deadline). I don’t have further evidence to distinguish these possibilities, but I put roughly equal weight on them, which means that (b) is significantly likely.
Difficulty acquiring the necessary machine learning and engineering expertise to execute the project (hindering factor)
I think that the difficulty of acquiring the necessary machine learning and engineering expertise was the second largest hindering factor to the diffusion of GPT-3-like models. To clarify, this claim is specifically about having the expertise to overcome the challenges of training large language models. This claim is not about the expertise to independently discover algorithmic insights, though I believe that is a lesser hindering factor. The claim is based on the following evidence:
Several experts I consulted with emphasized the importance of machine learning expertise, and in particular, engineering expertise, in developing large language models.
Iulia Turc, former Software Engineer at Google Research who worked with large language models such as BERT: “[I]ndustry labs have the upper hand [compared to universities] of good engineering talent. Since designing a model is very different from scaling it (i.e., training it in a distributed manner over a fleet of machines), it’s very important that scientists and engineers come together…”
A researcher training large language models at an AI safety lab:
“In academia, I think the main bottleneck [for replicating a large language model like GPT-3] is the ability to hire engineers to build out a codebase. The distributed training complexity is a rather different area of expertise than a Ph.D. student and doesn’t fit into the incentive structure very cleanly.”
“[To scale up machine learning at today’s cutting edge] I’d say you need all of: (1) Enough compute to train the model a number of times since the first try will probably not work, (2) Experts on distributed training of LMs [...] (3) Experts on ML. This doesn’t require as much creativity as most people might believe, but you do need to be able to understand what’s going on and debug your training process.”
Only AI labs with teams of 10 or more people have succeeded at producing GPT-3-like models.
BLOOM is a possible exception, depending on how one counts “core contributors”. The majority of the development contribution may not have been from HuggingFace employees, but rather from various academic and independent collaborators in the BigScience collective. I think the BLOOM project succeeded, talent-wise, through the crowd-sourcing of talent. The project was very open to contributors, and the best contributors naturally came to the fore. Also, by the time people worked on BLOOM, there was more accumulated knowledge from research e.g., BigScience’s previous work on the T0-XXL model (Sanh et al., 2021), and open-source tools like Megatron-LM.
Another potential counterpoint: Although the GPT-NeoX team didn’t succeed at creating a GPT-3-like model, according to the lead contributor Sid Black (in personal correspondence[23]), there was around a 40–50% chance that they had the requisite talent and just didn’t get access to enough compute.[24] I would guess Black’s claim is overconfident, given that the team didn’t get to actually attempt GPT-3 replication with enough compute—if they had done so, I expect they would encounter unforeseen challenges that would stretch the duration of the project significantly. But the claim that “the GPT-NeoX team would have succeeded at creating a GPT-3-like model by February 2022 if they had access to enough compute from the beginning of the project” seems more than 20% likely to me.
No government labs were directly involved in the cases I studied. I would guess this is partly because they don’t have the relevant talent and would have a hard time acquiring it (e.g., because AI industry labs offer higher prestige, salaries, research freedom, and less bureaucracy). Governments were only involved via funding, in the BLOOM and PanGu-alpha cases.[25]
As an aside, I take the lack of direct government involvement as evidence that governments are generally more willing to fund teams that already have the requisite talent than to acquire the requisite talent directly.
Sponsorship of compute resources by separate parties (accelerating factor)
So far, I think the most important factor for lower-resourced actors to approach GPT-3-like capabilities has been the sponsorship of compute by separate parties. This accelerating factor is the flip side of challenges of acquiring compute as a hindering factor—sponsorship allows these actors to leap over the obstacle of acquiring compute.
The first key example is that CoreWeave provided compute to EleutherAI for free to develop and train GPT-NeoX-20B. According to Sid Black, one of the main contributors to developing GPT-NeoX-20B, EleutherAI spent nothing out of pocket on compute for the GPT-NeoX project. Prior to this, EleutherAI was using a TensorFlow Research Cloud (TFRC) scheme that provided free access to TPUs, but this was not sufficient to train GPT-3.[26] The incentive for CoreWeave was to have their hardware tested as they were starting up their cloud computing operation, and to gain insight on what is required to use their hardware for training large language models.[27] The incentive for TFRC prior to this seemed to be testing their TPU hardware and advertising the advantages of that hardware.[28]
The second key example of compute sponsorship from my case studies is that BigScience was provided €3M from French research agencies CNRS and GENCI to train the BLOOM model on the Jean Zay supercomputer (BigScience, 2022).
Sponsorship can enable actors to use models closer to the cutting edge than they’d otherwise have, to do research on such models, and to increase the number of people with access to these models (e.g., as happened with BLOOM open-sourcing its weights). But does the sponsorship of resources like compute ultimately matter for who develops transformative AI (TAI)? I think the sponsorship of resources is less likely to matter than diffusion among AI developers who can already afford paying for the resources themselves, because the actors receiving sponsorship will tend to be lower-resourced to begin with, and therefore less likely to keep up with or surpass the state-of-the-art. However, I think sponsorship is a factor worth bearing in mind when thinking about which actors could plausibly become contenders to develop TAI in the future, and when thinking about how to beneficially shape diffusion.[29]
To see this, consider that the sponsorship of compute could give smaller actors the necessary momentum to become more significant actors. As with the BigScience case, there could also be a big role for governments and associated funding agencies to play in sponsoring massive amounts of resources for AI developers. This is already the case in China. The Beijing Academy of Artificial Intelligence, Zhejiang Lab, and Peng Cheng Lab are Chinese government-sponsored entities that have provided support for funding and compute to recent AI research projects in China (Ding & Xiao, forthcoming). For instance, Peng Cheng Lab was involved in PanGu-alpha.
Open-source tooling for large-scale model training (accelerating factor)
Open-source tools that are specifically designed for large-scale model training were a notable accelerating factor in the cases I studied. There are two things to clarify about this:
If these tools were proprietary (but available to the public as commercial software), I don’t think the cost of the tools would be prohibitive. But the open-source nature of the tools is still important, because open-source tools are easier to use in the ML domain. Based on my own experience with ML code development, it’s important to be able to integrate open-source code with other code, and often to customize the code extensively, in order to suit a given machine learning project.
I am not referring to tools that are as essential to ML as PyTorch. Tools like PyTorch provide a foundation for any modern ML project, having become ubiquitous in ML research and development. Rather, I am referring to newer, more specific tools such as Megatron-LM. Megatron-LM makes it easier to train large-scale models that use the Transformer architecture (which all the GPT-3-like models in the diffusion database do).
The Megatron-LM codebase was first published in September 2019. It started as the code implementing NVIDIA’s 8-billion parameter language model, Megatron, which was introduced in Shoeybi et al. (2019).[30] Megatron was heavily based on the 1.5-billion-parameter GPT-2, the predecessor of GPT-3.[31] The Megatron-LM codebase was later used in Narayanan et al. (2021),[32] which as the title suggests, offers useful insights on efficient large-scale language model training.
Shelvane (2022) claims that the Megatron code release “made it very easy for anyone to train GPT-2-like models if they had access to enough GPUs; Aaron [a Brown University graduate student who replicated GPT-2] told [the author] that with the Megatron code and enough money, a high school student could do it.”[33] By the same logic, I make a similar claim for the current Megatron-LM codebase (after the “efficient large-scale training” paper was published) with respect to GPT-3. The Megatron-LM codebase has formed a significant part of the overall code base for OPT-175B, Jurassic-1-Jumbo, GPT-NeoX-20B, BLOOM, and Megatron-Turing NLG—though the latter is not really relevant to diffusion, since NVIDIA was directly involved.[34] The fact that Meta AI and AI21 Labs both used Megatron-LM code suggests that they benefit from open-source tools released by other actors. So the benefit is not limited just to small actors that tend to have less engineering talent, such as academic labs or independent collectives.
It’s difficult to quantify how much the Megatron-LM code helps, and it certainly does not remove most of the compute cost. The code merely helps with implementation. But given the prevalence of the Megatron-LM code in my case studies, I expect that it significantly reduces the talent barrier to start a GPT-3-like model development project. It probably also saves time and money by improving efficiency. Sid Black of EleutherAI told me that Megatron-LM and another tool called DeepSpeed were frustrating and time-consuming to use and extend. Despite that, he said that Megatron-LM is “really fast” and he was glad to have these tools available when developing GPT-NeoX-20B.
A similar tool which is often used alongside Megatron-LM is Microsoft’s DeepSpeed. According to the GitHub repo, “DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.” DeepSpeed, or a “forked” version of it on GitHub, was used in all the case studies where Megatron-LM was used except OPT-175B (as far as I could tell).
Similar specialized open-source software is used by other AI developers. In the Chinese sphere, there is MindSpore, which was used to train PanGu-alpha. Google’s PaLM used T5X and JAX, while DeepMind’s Gopher and Chinchilla used JAX and Haiku—though these are less specialized for language model training than Megatron-LM is.
Publicity that draws attention to an existing model’s capabilities (accelerating factor)
Although it is difficult to measure and track the effects of the hype surrounding an AI research result, I believe that hype is an important accelerating factor in the diffusion of GPT-3-like models, and will probably play a key role in the diffusion of future state-of-the-art machine learning models. What I mean by hype is a combination of (a) the amount of attention that something gets, (b) the belief that the thing is promising in some way, e.g., it’s something worth replicating, or reveals a research direction worth pursuing. My point about the importance of hype here is related to my previous takeaway about the importance of attention to information.
First of all, GPT-3 was surprising in some sense. I estimate that GPT-3 was published 11 months earlier than expected based on training compute trends at the time (90% CI: 5 to 17 months).[35] Second, the insight which GPT-3 demonstrated was significant. Shelvane (2020, pp. 15-16) explains this point: “The idea [of the release strategy of GPT-2 and GPT-3] was that the models themselves were the hardest thing for bad actors to recreate, given the high compute costs required to produce the models. This was assuming that the papers, in contrast, did not contain truly novel insights. However, this focus on models has been questioned, with some risk-conscious AI researchers arguing that the GPT-3 paper was actually the risky thing. The paper, alongside other papers that OpenAI published in 2020, demonstrated to many onlookers the benefits of scale: if you throw a large amount of compute and data at a model with a very high number of parameters, you can get very impressive capabilities. Some people viewed this as dangerous in that it accelerates the field’s progress towards advanced AI, thus giving the world less time to prepare” (my emphasis).
A massive increase in hype around GPT-3 occurred not when the GPT-3 paper (Brown et al., 2020) was first published, but after people started demonstrating capabilities with the OpenAI API on Twitter.
The paper was made deliberately boring, and published without a blog post that normally accompanies milestone results from OpenAI.[36]
As pointed out in Shelvane (2022), the GPT-3 Google search trend in 2020 indicates how the interest in GPT-3 only rose to a significant level about seven to eight weeks after the paper was published on May 28, 2020. The relative search interest sat around 1-2% between May 28 and July 11, then exploded from 1-2% to 36% after Jul 11, and then peaked at 100% between Jul 19-25.[37] This trend correlated with Twitter activity involving GPT-3. Shevlane (2022) writes: “I downloaded around 63,000 tweets mentioning ‘GPT-3’ from Twitter’s API, from the period 12th-22nd July 2020. The number of tweets mentioning GPT-3 climbed from close to zero at the start of this period to a spike of about 900 (per 3 hour interval) around July 20th. [...] The tweets I found with the most engagement (in terms of retweets and likes) were early users of GPT-3 who were demonstrating GPT-3’s ability to write functioning software code. This was a much more accessible demonstration of GPT-3’s capabilities than the paper had given.”[38]
I’m very uncertain whether this hype strongly influenced the subsequent R&D decisions of specific leading AI developers. My best guess is that the knowledge of GPT-3’s existence sped up both DeepMind and Google’s work scaling up language models by six months (90% CI: 1–18 months). But I have not been able to distinguish whether this acceleration was driven by insider knowledge, or the publication of GPT-3, or the hype generated after publication, or some combination of those factors. In addition to the surprisingness and hype of GPT-3 argued above, I have the following evidence for this claim:
A researcher who has trained large language models at an AI safety lab told me: “I think GPT-3 probably pushed other labs in this direction about a year earlier than they otherwise would have. It’s a bit hard to know for sure. There were certainly other groups training larger and larger LMs each few months and they were doing better and better, but it wasn’t obviously clear to everyone that scale was the main ingredient there.” (Note that this claim of “a year earlier” had a small weighting in my estimate of when the equivalent of GPT-3 was expected to be published, stated in a point below.)
Geoffrey Irving (last author of the Rae et al., 2021) telling me that “GPT-3 did add an organizational push” for DeepMind to scale up their language models.[39]
I also have one piece of countering evidence, but I don’t think this outweighs the favoring evidence. I asked Iulia Turc—a former Software Engineer at Google Research who worked with language models such as BERT: “Do you think that GPT-3’s increased model size, and the resulting improvements in task performance, generality, and reduced need for fine-tuning, was surprising to researchers at the cutting edge of natural language processing?” Turc responded: “I don’t think it was surprising, I think it was impressive from an engineering point of view.”
I estimate that GPT-3 arrived 11 months (90% CI: 5 to 17 months) earlier than expected, mostly based on trends in the amount of training compute used for ML systems at the time immediately before GPT-3 was publicized (see this appendix).
I used the estimate of “when a GPT-3 equivalent was expected” above as a strong prior for “how much GPT-3 sped up DeepMind and Google’s work scaling up language models”. But after intuitively accounting for the evidence in the above quotes from experts, I made the following updates to reach my final estimate of six months (90% CI: 1 to 18 months):
The median estimate of the speed-up should be earlier, because (a) Iulia Turc didn’t think GPT-3 was very surprising in terms of scale or performance, (b) the estimate of “when a GPT-3 equivalent was expected” doesn’t fully account for the growing interest in pretrained large language models among top AI developers since around 2018 (when OpenAI’s original GPT (Radford and Narasimhan, 2018) and Google’s BERT (Devlin et al., 2018) were published).
The confidence interval should be wider, given that I have almost no knowledge of what DeepMind and Google’s plans around language model scaling actually were around the time that GPT-3 was published.
Diffusion cascades: the publication of progress accelerates the diffusion of the final product
Here I introduce the concept of a diffusion cascade: the acceleration of diffusion that results from diffusion of artifacts that are relevant to producing a given closed-source model. The concept of a diffusion cascade applies when initially there is a given closed-source model that is only accessible to one actor, and no other actor fully understands how to produce that model and/or has all the resources needed to produce that model.[40] The incremental progress and open sourcing made by other actors in the meantime fills in the gaps in knowledge and resources, and thereby accelerates diffusion. Even if the latest capability advance is only reachable by leading AI developers initially, those leading developers can make diffusion to other actors happen more easily and sooner than otherwise.
Tools, datasets, smaller models, and the accumulation of published details speed up the cascade
Below I list some specific drivers of diffusion cascades, and empirical examples of those drivers being involved in diffusion cascades. I also indicate the current relative importance of each driver on a sub 1-5 scale (5 is most important) according to my judgment, which is based on a combination of independent reasoning and the empirical examples. Importance means how much this driver has accelerated diffusion empirically.[41]
Open-source software tools. (Importance: 5) While there are long-standing open-source tools for machine learning such as PyTorch, more specific open-source tools specialized for large language model training can emerge, which embed a lot of knowledge of how to train large language models. Megatron-LM and DeepSpeed are open-source tools for training large language models, and were used extensively to train GPT-NeoX-20B, OPT 175B, Jurassic-1-Jumbo, and BLOOM. Sid Black told me that while he had qualms with Megatron-LM and DeepSpeed (namely, they were frustrating and time-consuming to use and extend), Megatron-LM is “really fast” and he was glad to have these tools available when developing GPT-NeoX-20B.
Accumulation of insights and implementation details from different research articles. (Importance: 4) Even if there is a long series of closed-source language models developed by different actors, the current tendency is for many of those actors to publish research articles with information about their methods (more on this in the post on publication norms and release strategies). Due to the various independent decisions about what information is included in these research articles, more and more information on how to reproduce a given model can gradually be accumulated.
Example: Narayanan et al. (2021). The paper accompanying the release of the Megatron-LM tool includes information on different types of parallelism methods and how they can be composed to scale to “thousands of GPUs and models with trillions of parameters,” and “intuition as to how to configure distributed training of a large model.” This paper does not itself present new models, it just provides insight on how to scale and train them efficiently.
Open-source smaller models. (Importance: 3) Many pretrained language models that are smaller but similar in design to GPT-3 are open-source—for example, GPT-2, and the OPT family (except OPT-175B, which isn’t smaller than GPT-3). Having these models (and the code to instantiate the models) available makes the precise implementation of those models clearly and completely known, beyond just specifying the model architecture and its hyperparameters in a research paper. However, if the smaller model falls significantly short of the full model in performance, the full model normally needs to be trained from scratch,[42] so my impression is that having smaller models available does not necessarily reduce the challenge of scaling up. Empirically, the publication of smaller models is only of moderate importance, because the current norm is to publish model architecture details in research papers (including for the larger models, even when the model weights aren’t published), and that saves most of the work in figuring out how to implement a model.[43]
Open-source datasets. (Importance: 3) For example, The Pile was used to train GPT-NeoX-20B and (partially) OPT 175B (Gao et al., 2020). Although such datasets for language models usually just consist of text data scraped from public internet sources, scraping the data and storing it in an appropriate format is a significant effort.
Coordinating on greater secrecy, even just delayed publication, can slow down diffusion
The obvious way to slow down a diffusion cascade, and diffusion in general, is to have greater secrecy. In the absence of coordination, the best that one actor can do on this front is to try to keep knowledge of a project or model completely secret, not even revealing the model’s existence.
My impression is that it is not uncommon to keep models secret temporarily (i.e., delaying publication past the minimum time needed to produce a publication).
For example, the GPT-3 175B model was not announced for “months” after it was trained, and this seemed partly motivated by a desire to delay progress toward artificial general intelligence.[44] My low-confidence best guess is that the paper was published seven months after training finished and could have been ready to publish four months sooner than it was if the work towards publishing the paper was done as soon as possible.[45]
The publication of Gopher was delayed even longer than my estimate for GPT-3. Based on the Gopher model card, the paper was published 12 months after the model finished training.[46] So by similar logic, I think the Gopher paper could have been published nine months sooner than it was. I speculate that the delay in publication about Gopher was for the same reason as not releasing the training code, dataset, and model weights for Gopher. Geoffrey Irving told me that the reason for the latter was to “[reduce] diffusion of objects that can cause harm if not aligned further.”
A staff member at an industry AI lab, who has worked with large language models, told me off-hand that publication of Google’s PaLM model was probably delayed by a couple of months, but this is weaker evidence and I did not find out the rationale for the delay.
One thing to note here is that while a model may remain secret to the general public until it is published, I suspect that information does sometimes leak, especially among peers in AI development at different labs.[47] Rumors can also circulate, even to the public, though it’s unclear when this is intentional and when it is unintentional. For example, Hao (2020) seems to refer to the text-to-image model DALL-E (or similar preliminary work) 11 months before DALL-E was announced (Ramesh et al., 2021).[48]
Besides just delaying publication, actors could limit diffusion cascades (if that is their goal) through more comprehensive secrecy around information and resources—even if the existence of the model and research results about the model are publicized. Given the various information sources and artifacts that can drive a diffusion cascade, it would be more effective to not just keep the model secure, but also e.g., the specialized software tools that were used to train the model, and the datasets, and the details of training infrastructure and parallelism strategies. For example, the developers of GPT-3 did not explain or open-source the software tooling that was used to train the GPT-3 model. This seems to have left a gap that Narayanan et al. (2021) had to spend time filling (i.e., with the Megatron-LM codebase).
Appendix: GPT-3 came 5–17 months earlier than expected, due to OpenAI’s willingness to spend on the compute and to solve the engineering challenges
I used 3 methods to estimate when experts would have expected GPT-3 (or the rough equivalent) to be released, immediately before GPT-3 was actually publicized. Estimating this provides evidence about the extent that multiple discovery was involved in the diffusion of GPT-3-like models, and about the counterfactual impact of publicizing GPT-3. The estimates are detailed in the following subsections.
Expected timing based on the average training compute trend
First I analyze how unexpected GPT-3 was in terms of the average trend in training compute for models over time. My analysis is based on this interactive plot of compute trends by Epoch. Below are the initial steps I took and the results I obtained from different plots:
Initialization
Start with the default settings, as in this link
Click the three-bar menu in the top right of the plot to open the settings
Check “Separate by category” so that the Language domain data has its own trends
Uncheck “Split trendlines in Large Scale Era”
Set “Large scale” to “ignore” so the red “Large Scale” trend disappears
Set the x-axis maximum to just before April 2020 using the slider at the bottom, such that all language models up until GPT-3 175B are included, but GPT-3 175B itself is excluded.
At the time of writing, there is a typo in the database used for this data which sets the publication date of GPT-3 to April 28, 2020 rather than May 28, 2020. I don’t think this affects my conclusions significantly.
You may have to zoom into the plot with the scroll wheel to verify this.
Alternatively, set the “endDate=” part of the page URL to an exact value, e.g. “endDate=2020-3-31”
The link to plot with the above changes is here
The resulting Language domain trend in the Deep Learning era is 0.8 OOMs/year
Using a straight edge to visually extrapolate the Language trend, I find that the trend predicts 3E+23 FLOPs of compute would be reached by about October 2021—17 months after the actual publication date of GPT-3 in May 2020.
Weight on this estimate: 0.4. Higher than average because I think the domain-specific trend is more reliable. The greater number of samples from the full Deep Learning Era also makes it more reliable.
Now check “Split trendlines in Large Scale Era”. The “Large Scale Era” Language trend should now be 1.1 OOM/year. Link to these settings is here.
Prediction (using the same extrapolation method as above): about February 2021, nine months after actual
Weight: 0.2. This is a more “inside view” trend which I think is plausible. It takes better account of the large scale models that were released more recently. But the sample is slightly smaller so the prediction is not as reliable.
Now uncheck “Separate by category” (link)
Use the one trend in the “Large Scale Era”—0.4 OOMs/year
Prediction: October 2026, which is 6 * 12 + 3 = 75 months after actual
Weight: 0.1. The data looks very noisy, spans a short time period, and doesn’t account for domain-specific trends. But it is still an informative “outside view” estimate.
Now Uncheck “Split trendlines in Large Scale Era” (link)
Use the one “Deep Learning Era” trend
Prediction: February 2023, which is 36 − 3 = 33 months after actual
Weight: 0.2. To me this is a stronger “outside view” prediction than the previous, because there are more samples.
Now set the “Large scale” dropdown setting to “label” and use the “Large Scale” trend of 0.3 OOMs/year (link)
March 2022—24 − 2 = 22 months after actual
Weight: 0.1. Small sample size, but still an informative estimate based on the belief that the “Large Scale” trend is more relevant.
Most outside view estimate: 75 months
Most inside view estimate: nine months
Unweighted average: (17 + 9 + 75 + 33 + 22) / 5 ~= 31 months
Weighted average: 0.4*17 + 0.2*9 + 0.1*75 + 0.2*33 + 0.1*22 ~= 25 months
Sample standard deviation of estimates: 26 months
Filtered standard deviation of estimates (i.e. excluding the 75 month estimate): 10 months
I used the weighted average as the central estimate, and the filtered standard deviation to get 90% confidence bounds. Thus my first estimate for the expected arrival time of GPT-3 is June 2022 (90% CI: August 2021 to April 2023). A major limitation of this estimate is that I am using a prediction of the average milestone system rather than a prediction of the most expensive system. Including the “Large Scale” trends in my aggregate prediction compensates for this somewhat (because the “Large Scale” data has the most expensive systems), but the above average predictions are probably still later than experts actually expected. Due to this limitation, I only put 30% weight on this estimate.
Expected timing based on the upper range of the compute trend
One way to improve on the first estimate is to look at when the trend predicts GPT-3’s training compute minus some amount of deviation based on the variance in the data. Due to time constraints I have not computed a confidence interval in the trendline. However, visually inspecting the Language category data over the whole “Deep Learning era” in this plot, we can see that data points that are about 1 order of magnitude above the trend line are common. For example, Meena in Jan 28, 2020 has 1.1E+23 FLOP while the trend is at about 1E+22 FLOP, and Seq2Seq LSTM in Sep 10, 2014 has 7.3E+18 FLOP while the trend is at about 4E+17 FLOP. The biggest outlier is GNMT (Sep 26, 2016) at 6.9E+21 FLOP when the trend is only at about 2E+19 FLOP; however, I think this is too large an outlier to significantly weight people’s best-guess expectations about when GPT-3’s amount of training compute would be used.
Based on this rough inspection, I will just look at when the trendline predicts one order of magnitude lower than the true value, i.e., when it predicts 3E+22 FLOP rather than 3E+23 FLOP. This appears to occur in late July 2020, only 2 months after GPT-3 was actually published.
Based on this, I chose 2 months as my central estimate for the time that GPT-3 was expected (in terms of training compute), relative to when it was actually published. Like the first estimate, I used the filtered standard deviation of 10 months to get confidence bounds. Thus my second estimate for the expected arrival time of GPT-3 is July 2020 (90% CI: December 2019 to May 2021). Although this estimate is less rigorous than the first estimate, I think it is closer to the quantity I’m actually trying to estimate, so I put 50% weight on it.
One expert opinion
Finally, I have some evidence about the expected timing of GPT-3 from one researcher who has trained large language models at an AI safety lab. They told me: “I think GPT-3 probably pushed other labs in this direction about a year earlier than they otherwise would have. It’s a bit hard to know for sure. There were certainly other groups training larger and larger LMs each few months and they were doing better and better, but it wasn’t obviously clear to everyone that scale was the main ingredient there.” This isn’t a direct claim about when GPT-3 was expected to arrive, but their statement suggests that if GPT-3 was published 1 year later, then that would be more in line with the expectations of the field. As with the other estimates, I will put a confidence interval of +/- 10 months either side of this 12-month estimate. So my third estimate is May 2021 (90% CI: July 2020–March 2022). Since this is based on an off-hand comment from one expert, I only put 20% weight on it.
Overall estimate: 11 months (90% CI: 5 to 17 months) sooner than expected
I put my three estimates together in a weighted average using this Guesstimate model and obtained an overall estimated delay of 11 months (90% CI: 5 to 17 months), or an estimated date of April 2021 (90% CI: October 2020 to October 2022). Note that the confidence interval does not account for the correlation between the confidence intervals of the individual estimates, and the correlation between the first and second estimates (due to using the same data and trend), so it probably should be wider to reflect my true confidence.
What this overall estimate implies is that GPT-3 arrived significantly earlier than expected. I think that the most likely reason for this unexpected event is OpenAI simply being willing and able to invest in a larger amount of compute. The “willing” part is probably the key factor in OpenAI getting to this amount of compute before other leading language model developers just prior to GPT-3’s release, especially Google.
Acknowledgements
This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
To be clear, only 7 of these 9 GPT-3-like models are in my 9 full case studies; 2 models in my case studies do not meet my definition of GPT-3-like.
Note that this is not a fair comparison with talent holistically. Talent can be the key bottleneck even when salaries are only a small fraction of project costs, due to the time and financial cost of producing enough people with the requisite skills. Further analysis of the holistic talent cost seems worthwhile in future work.
Sponsorship of compute resources could involve an actor doing any of the following things: (a) giving another actor ownership of compute hardware, (b) giving another actor access to compute hardware, (c) giving another actor money that can only be used on compute, or (d) giving another actor money with the intention that it is used for compute. Only cases (b) and (c) occurred in my case studies.
E.g., Beijing Academy of Artificial Intelligence (BAAI) and Peng Cheng Laboratory (PCL) were involved in the GLM-130B and ERNIE 3.0 Titan models respectively. See my survey of models covered previously for details.
I won’t make the effort to detail all these insights, but note that the Gopher paper (Rae et al., 2021) is titled “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”.
I assessed which models are GPT-3-like in a previous post. The nine GPT-3-like models are Gopher, Hyperclova, Jurassic-1-Jumbo, Megatron-Turing NLG, LaMDA-PT, Yuan 1.0, ERNIE 3.0 Titan, Chinchilla, and PaLM.
In a previous post, I estimated that 1000 (90% CI: 200–3000) people could be eligible to access the model weights of OPT-175B, and all of these people could be granted access in the first year following release of OPT-175B. I don’t know what number of people are actually permitted to access OPT-175B so far (i.e., who’ve requested and been granted permission) and it’s very likely lower than the number of people that could be eligible, but as of November 2022 I think that number is more than 80% likely to be higher than 73, which is the total of “core team size” for the models that I estimated “core team size” for (see this cell of the diffusion database).
See Wiblin and Harris (2022): Rob Wiblin: “Are there any historical case studies of information leaks in ML? Are there any cases where an ML model has been stolen in the past?”. Nova DasSarma: “That’s a great question. I don’t think I can think of one offhand actually. If they have been stolen, then it’s one of those things where they’ve kept hush-hush about it.”
Paraphrasing from personal correspondence: Ben Cottier: “Do you know any examples of hackers accessing ML-related artifacts like datasets, trained models, etc.?” Jeffrey Ladish: “Ram Shankar Siva Kumar from AI Red Team at Microsoft—they used phishing to steal a model etc. That’s the only example I know of.” I found Field (2022) related to what Jeffrey Ladish was referring to. This isn’t a “real world case of ML model theft” in that it was a red-teaming exercise and didn’t actually result in diffusion to unauthorized parties.
This estimated delay is explained in the section on publicity.
I think doing this in four months would probably be feasible, based on my estimates of training wall-clock time and total project duration (i.e., time until having the trained model; this excludes time for writing and publishing a paper) in the diffusion database. The case with the most confident estimates is OPT-175B, with a total project duration of 78 days, including 33 days of training time. However, there were four months from OPT-175B completing training to the paper being published in May 2022. So my estimate of one month to evaluate the model and publish is probably too short.
Geoffrey Irving (Safety Researcher at DeepMind) told me that “[People who worked on Gopher] had already started LLM scaleup for the purpose of using them for communication and recursion-based alignment schemes soon after I joined [DeepMind, from OpenAI, in October 2019], but GPT-3 did add an organizational push.”
See Shelvane (2022). A senior member of OpenAI (who is specified on p.27 of the PDF) told the author: “GPT-3 existed for a long time before the paper came out. We delayed the paper. [...] But it’s months, it doesn’t really count. And you’re sitting there, fucking white-knuckling it, because it’s really costly if someone releases their paper, and you have fucked this up somehow. So you’re under pressure” (p.66 of the PDF).
This is just a rough estimate, and expecting a result to be published by a certain date does not guarantee that no other equivalent model would have been published otherwise. Nonetheless, it is evidence in the direction of “multiple discovery was not involved in any cases of GPT-3-like model diffusion”.
Full correspondence is available here upon request.
My thinking on this is generally informed by Ladish and [lennart] (2022).
I focus on development rather than access to GPT-3-like models here because I think development is more important. See a previous post for my reasoning on this.
In my case studies there is a close relationship between the factors for diffusion and the resources that drive capabilities (i.e., money, compute, data, and talent). I think this is due to replication and incremental research being the main mechanisms of diffusion for 2 years. The actors involved had to actually develop models independently in order for the models to diffuse, because there weren’t any open-source models for a while. But if the main diffusion mechanism happened to be espionage, then an accelerating factor might be the poor information security at an organization. So the factors for diffusion and the resources that drive capabilities can be quite separate.
This is because OPT-175B allows more people to get direct access to its model weights, and finding model weights seems to be the most compute-intensive aspect of AI development/deployment.
See the “Training cost (2022 USD)” column of the diffusion database, noting which models are classified as GPT-3-like in the “GPT-3-like model?” column. Some GPT-3-like models in the database do not have cost estimates, but seem very likely to fall within the $1–10M cost range given their training compute (see the “Training compute (FLOPs)” column).
Note that this is not a fair comparison with talent holistically. Talent can be the key bottleneck even when salaries are only a small fraction of project costs, due to the time and financial cost of producing enough people with the requisite skills. Further analysis of the holistic talent cost seems worthwhile in future work.
See Abstract of Zeng et al. (2021)
My conversation notes with Sid Black are available upon request.
Black indicated this rough 40–50% confidence after seeing a draft of this text (which included my skepticism about Black’s claim). Black originally told me (paraphrasing from conversation) that “We did kinda become bottlenecked by compute—if CoreWeave had offered more GPUs, we probably could have [replicated GPT-3].” I interpreted the word “probably” to be more than 50% confidence.
See this section for PanGu-alpha and this section for BLOOM in an appendix.
See Shelvane (2022, p. 73): “The greatest bottleneck has been getting access to enough compute. Initially Eleuther was still using Google’s TFRC scheme. This was not sufficient…”
Shelvane (2022, p. 73): “[CoreWeave] planned to buy more NVIDIA GPUs and rent them out to people training large models. Connor told me: ’So, the deal was: we test the hardware, we figure out what do you need to train these kinds of models . . . because they don’t have in-house capacity ML engineering talent. And then they buy [the hardware]. We get to train our model on it and release it for free. And everyone’s happy.”
Shelvane (2022, p. 40): “I asked Aaron [one of the Brown University graduate students that did a project replicating GPT-2] what value the Google’s TFRC team would have seen in the project: ‘To test the systems, and just like...They just want to get more papers out there on it that can only be done on TPUs, because if you’re a company and you want to iterate on that for your own personal thing then you have to pay them to use TPUs. That’s basically it—that’s basically the value in general.’”
Sponsorship may also be important in the sense that it increases the number of people working on larger-scale AI projects, which may increase the number and expertise of AI engineers and researchers, which may then get hired by the leading AI labs.
On p.2 of the paper it says “We open source our code along with the training and evaluation pipelines at https://github.com/megatron-lm”. That link is broken, but version 4 of the paper (Shoeybi, 2020) changes the link to https://github.com/nvidia/megatron-lm, so I assume that these links correspond to the same codebase which has been updated over time.
On p.3 of the Megatron paper it says “Our work focuses on architectures similar to GPT-2.”
The paper’s Abstract page on arXiv says “Our code is open sourced at this https URL,” which links to the Megatron-LM GitHub repository.
See p.41 of the PDF.
See the “Specialised software tools used for development” column in the diffusion database.
See this appendix for my reasoning.
See Shelvane (2022, Ch 2 p. 3 or p. 66): “In addition to delaying the paper, another strategy was to write the paper in a way that avoids attention-grabbing. The paper was written so as to avoid ‘hype’ and include discussion of the model’s weaknesses.”
Another interesting aspect of the search trend is the regions. China was the region with the highest fraction of total searches; 2nd was the interest in South Korea at 34% of China’s, and ranking 17th was the US at 11% of China’s. However, note that there are many small countries that rank highly because the metric used is the fraction of total searches in the given region.
Shelvane (2020, p. 67).
Full correspondence is available upon request. Irving was not clear what exactly is meant by “GPT-3” in that claim—whether it was insider knowledge of GPT-3 before GPT-3 was published, or the publication of the paper, or the huge publicity after publication, or some combination of those events.
Or to produce a close enough replica of that model—the exact weight values of a trained model will always differ between independent training runs.
Note that I haven’t tried to predict how important each type of artifact will be in future diffusion cascades; I leave that to potential future research.
From my limited understanding of the Transformer architecture and how the architecture tends to be scaled up, it is conceivable that learned weights from a smaller model could be copied into a larger model, with the extra weights starting from initial values. But even if it’s possible, I don’t think this would be as effective as training the full-size model from scratch, because I have not heard of this method being used effectively.
This claim is based on all nine of the large language models that I studied in-depth detailing their model architecture and associated hyperparameters—see this column in the diffusion database.
Shelvane (2022, Ch. 2 p.3, or p.66): “Proponents of AGI risk will sometimes criticise OpenAI for contributing too much to advances in AI capabilities [...] It appears that these kinds of considerations did inform the way that GPT-3 was shared. [an OpenAI staff member] told me: ‘GPT-3 existed for a long time before the paper came out. We delayed the paper. That was one of the things we could do for AGI stuff. But it’s months, it doesn’t really count.’”
My best guess is that the GPT-3 175B model finished training in October 2019, seven months before publication in May 2020—my reasoning is in the note of this cell of the diffusion database. I guess that the evaluation and paper-writing process took about three months in total, based on my intuition of how long different steps take. I think this is longer than most AI research papers, but the paper is long and seems to have required unusually high effort. That implies a four-month delay in publication.
The Model Card in Appendix B of the paper (p.49) states the “Model Date” is December 2020, and according to the paper that introduces Model Cards this means “When was the model developed?” I interpret “developed” as the date that the model finished training—this interpretation is partly based on another detail from the Gopher paper (Rae et al., 2021): “We trained Gopher for 920 hours in November and December 2020 in Google’s Georgia datacentre.” (Appendix F, p.103)
This is based on at least two AI developers at leading AI labs agreeing with me in informal conversation that this does sometimes occur, but I do not have any record of those conversations.
The article states “One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources.”