At Epoch, helping to clarify when and how transformative AI capabilities will be developed.
Previously a Research Fellow on the AI Governance & Strategy team at Rethink Priorities.
At Epoch, helping to clarify when and how transformative AI capabilities will be developed.
Previously a Research Fellow on the AI Governance & Strategy team at Rethink Priorities.
Hi Charlotte—as you can imagine, estimating the latter is much more difficult due to reasoning about counterfactuals. But I do have some thoughts on it in this section of a post in the sequence.
I think the key claim you’d be looking for there is:
My best guess is that the knowledge of GPT-3’s existence sped up both DeepMind and Google’s work scaling up language models by six months (90% CI: 1–18 months). But I have not been able to distinguish whether this acceleration was driven by insider knowledge, or the publication of GPT-3, or the hype generated after publication, or some combination of those factors.
As my 90% confidence interval shows, I’m very uncertain, but I hope this helps.
I have made a big update regarding this claim:
What about for a very large-scale application of a GPT-3-like model—for example, generating text equivalent to 1% of global Twitter activity for one year, or assisting one million software developers with coding for one year? I estimate that deploying a model like BLOOM in these ways would be 20% of the cost of developing the model (90% CI: 10 to 68%), in terms of the dollar cost of compute alone. This means that deployment is most likely much less prohibitive than development. But it means I give a 5% chance that for the largest-scale applications, the cost of deploying the model is at least 68% of the cost of developing the model, which would make deployment similarly prohibitive.
The claims about the cost of the specific deployment scenarios (which were oversimplified to begin with) may still be fairly accurate. But in terms of the intent behind the estimates I made, I think I greatly underestimated the largest scale of deployment for LLMs, a scale which is becoming more common and which I understand a little better. I now think that for the largest, most commercially successful LLMs, the total compute spent on deployment is much larger than in development.
My update was mostly influenced by several more sources (and more credible sources than the ones I reviewed in the post) suggesting that the total compute that major AI companies spend on inference is significantly larger then the total compute spent on training and experimentation:
https://arxiv.org/pdf/2111.00364.pdf, p.3, Fig. 3 caption: “At Facebook, we observe a rough power capacity breakdown of 10:20:70 for AI infrastructures devoted to the three key phases — Experimentation, Training, and Inference”. Also, “Considering the primary stages of the ML pipeline end-to-end, the energy footprint of RM1 is roughly 31:29:40 over Data, Experimentation/Training, and Inference”.[1][2]
https://arxiv.org/abs/2204.05149, p.7: “Across all three years, about ⅗ of ML energy use is for inference and ⅖ for training. These measurements include all ML energy usage: research, development, testing, and production.”
https://www.semianalysis.com/p/the-inference-cost-of-search-disruption: “inference costs far exceed training costs when deploying a model at any reasonable scale. In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis.”
However, this doesn’t significantly update my conclusion about the importance of focusing on development rather than deployment as a target of intervention (point 2c in the Key Takeaways). This is because of theother strong reasons to focus on development that I mention. I would revise point 2c to say that, even if the amount of compute is smaller in total, the compute you have to spend on training tends to be more up-front and all-or-nothing than deployment which can be scaled quite smoothly. This creates a greater barrier.
I have edited the post to point out this comment, but for the sake of posterity and prioritizing other projects, I won’t be updating the rest of the post.
Power and energy usage are not 1-1 with compute usage, especially over time as new hardware improves energy efficiency. But there is a clear relationship: computation requires running GPUs for some time, which consumes a fairly consistent amount of average power. I don’t expect that improvements in energy efficiency have a big impact on the ratio of development and deployment compute.
RM1 denotes one of Facebook’s six models that “account for a vast majority of compute resources for the overall inference predictions at Facebook, serving billions of users world wide” (see footnote 4 on p.4). RM1 is the single most carbon-intensive model out of these six models (see Fig 4 on p.4).
Hi! I’m in Lisbon until Sunday. Excited to meet you!
Also, I’d like to be clear about what it means to “keep up”. I expect those lower-resourced types of actors won’t keep up in the sense that they won’t be the first to advance state-of-the-art on the most important AI capabilities. But the cost of a given ML system falls over time and that is a big driver of how AI capabilities diffuse.
Thanks Haydn!
I just want to add caution on taking the extrapolations too seriously. The linear extrapolation is not my all-things-considered view of what is going to happen, and the shaded region is just the uncertainty in the linear regression trendline rather than my subjective uncertainty in the estimates.
I agree with you inasmuch as I expect the initial costs of state-of-the-art models to get well out of reach for actors other than big tech (if we include labs with massive investment like OpenAI), and states, by 2030. I still have significant uncertainty about this though. Plausibly, the biggest players in AI won’t be willing to spend $100M just on the computation for a final training run as soon as 2030. We still don’t have a great understanding of what hardware and software progress will be like in future (though Epoch has worked on this). Maybe efficiency improves faster than expected and/or there just won’t be worthwhile gains from spending so much in order to compete.
how sensitive do you think your conclusions are to the choice of using GPT-3 as your point of reference?
I tried to qualify claims to account for using a single point of reference, e.g. just talk about pre-trained language models rather than all ML models. However, as I note in the final section of this post, my claims about the broader implications of this research have the lowest confident and resilience. It feels really hard to quantify the sensitivity overall (I’m not sure if you have a way to measure this in mind). But my off-the-cuff intuition is that if my language model case studies turn out to not at all generalise in the way that I assumed, my % likelihoods for the generalised claims throughout the sequence would change by 20 percentage points on average.
I’m curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.
I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea.
I didn’t explicitly compare to GPT-2 but I’d say that this section (“Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret”) is implicitly explaining why GPT-3′s release strategy succeeded more than GPT-2′s release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.
For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I’m curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name “Stable Diffusion” being a model resulting from diffusion is funny.)
I think the training compute requirement and hardware improvements are two key differences here. Epoch’s database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.
As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.
There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven’t looked into that.
You can find my take on that in this section, but I’ll put an excerpt of that here:
The main driver of this is improved GPU price performance. The actual GPT-3 training run used NVIDIA V100 GPUs, but OPT-175B and other more recent GPT-3-like models were trained on A100 GPUs. A100 and V100 GPUs currently have a similar price on Google Cloud. However, A100 can be up to six times more efficient than V100, since
V100 has about three times slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
V100 has less than half the memory capacity of the 80GB A100 chip, at 32 GB, therefore requiring over two times the number of chips to fit a model in memory.
OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.
As an aside, it’s still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?
this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
We follow GPT-3 (Brown
et al., 2020) by using their prompts and overall ex-
perimental setup. We compare primarily to GPT-3,
having aimed to re-implement their evaluation set-
tings, but include reported performance of other
LLMs on a per-task basis when available (Lieber
et al., 2021; Rae et al., 2021; Hoffmann et al., 2022;
Black et al., 2022)
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
In WIC, we see that the OPT models always out-
perform the GPT-3 models, though the numbers
reported by Brown et al. (2020) also seem question-
able, given WIC being a binary classification task.
But p.3 also mentions
For MultiRC, we are unable to replicate the GPT-3
results using the Davinci API within our evalua-
tion setup [...]
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.
Thanks for raising this. On reflection, I think if I had started this project now (including re-considering my definition of “successful replication”) I probably would not have classed OPT-175B as a successful replication. I probably should flag this clearly in the post.
As noted in point 2(d) of the final section of the post, I was more-or-less sitting on this report for a few months. I made significant revisions during that period, but I was paying less attention to new evidence than before, so I missed some evidence that was important to update on.
Pretty important detail! Thanks, I’ve changed it.
This is some advice I wrote about doing back-of-the-envelope calculations (BOTECs) and uncertainty estimation, which are often useful as part of forecasting. This advice isn’t supposed to be a comprehensive guide by any means. The advice originated from specific questions that someone I was mentoring asked me. Note that I’m still fairly inexperienced with forecasting. If you’re someone with experience in forecasting, uncertainty estimation, or BOTECs, I’d love to hear how you would expand or deviate from this advice.
How to do uncertainty estimation?
A BOTEC is estimating one number from a series of calculations. So I think a good way to estimate uncertainty is to assign credible intervals to each input of the calculation. Then propagate the uncertainty in the inputs through to the output of the calculation.
I recommend Squiggle for this (the Python version is https://github.com/rethinkpriorities/squigglepy/).
How to assign a credible interval:
Normally I choose a 90% interval. This is the default in Squiggle.
If you have a lot of data about the thing (say, >10 values), and the sample of data doesn’t seem particularly biased, then it might be reasonable to use the standard deviation of the data. (Measure this in log-space if you have reason to think it’s distributed log-normally—see next point about choosing the distribution.) Then compute the 90% credible interval as +/- 1.645*std, assuming a (log-)normal distribution.
How to choose the distribution:
It’s usually a choice between log-normal and normal.
If the variable seems like the sort of thing that could vary by orders of magnitude, then log-normal is best. Otherwise, normal.
You can use the data points you have, or the credible interval you chose, to inform this.
When in doubt, I’d say that most of the time (for AI-related BOTECs), log-normal distribution is a good choice. Log-normal is the default distribution in Squiggle when you specify a credible interval.
A uniform distribution might occasionally be useful if there are strict lower and upper bounds to the value and the value varies roughly uniformly. But you can clip other distributions by strict bounds in Squiggle using lclip and rclip.
If you do sanity checks and they conflict, how do you update?
This goes without saying, but double-check the calculations. Don’t go on a deep-dive before being confident that the calculations are implemented correctly.
Account for uncertainty. Are the estimates really in conflict, or could the confidence intervals in the estimates overlap?
Consult other people about why this conflict occurred and how it could be resolved.
If you have an explicit model that produced your original estimate, then I think it’s best to first try to find the flaw in your model. If it’s a flaw that could be patched somehow, then patch it and see if there is still conflict.
If there’s no clear way to patch your model, then try a different model entirely. See if that alternate model’s estimate is in conflict with the sanity-check value. If there’s no conflict (or less conflict) then the new model is most likely a better model.
If there’s no alternate model that you can feasibly use, then you might resort to adjusting your estimate directly by some fudge factor, or averaging your estimate with the sanity-check value. But be sure to communicate in your write-up about the original estimates that conflicted, and explain how you resolved the conflict.
How should you make a central estimate or best guess from a small number of data points (e.g. three)?
If the data points vary by large factors or orders of magnitude, then the geometric mean is probably best, since it’s equivalent to the arithmetic mean on a logarithmic scale.
Otherwise, the arithmetic mean is fine.
If you think that the credibility of each data point varies significantly, you should assign different weights to each data point.
I don’t know of a general, principled way to set weights; it seems pretty intuition-based. But if one data point seems twice as credible or incorporates information that is twice as reliable, for instance, then it makes sense to assign it twice as much weight.