At Epoch, helping to clarify when and how transformative AI capabilities will be developed.
Previously a Research Fellow on the AI Governance & Strategy team at Rethink Priorities.
At Epoch, helping to clarify when and how transformative AI capabilities will be developed.
Previously a Research Fellow on the AI Governance & Strategy team at Rethink Priorities.
Pretty important detail! Thanks, I’ve changed it.
Thanks for raising this. On reflection, I think if I had started this project now (including re-considering my definition of “successful replication”) I probably would not have classed OPT-175B as a successful replication. I probably should flag this clearly in the post.
As noted in point 2(d) of the final section of the post, I was more-or-less sitting on this report for a few months. I made significant revisions during that period, but I was paying less attention to new evidence than before, so I missed some evidence that was important to update on.
this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
We follow GPT-3 (Brown
et al., 2020) by using their prompts and overall ex-
perimental setup. We compare primarily to GPT-3,
having aimed to re-implement their evaluation set-
tings, but include reported performance of other
LLMs on a per-task basis when available (Lieber
et al., 2021; Rae et al., 2021; Hoffmann et al., 2022;
Black et al., 2022)
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
In WIC, we see that the OPT models always out-
perform the GPT-3 models, though the numbers
reported by Brown et al. (2020) also seem question-
able, given WIC being a binary classification task.
But p.3 also mentions
For MultiRC, we are unable to replicate the GPT-3
results using the Davinci API within our evalua-
tion setup [...]
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.
You can find my take on that in this section, but I’ll put an excerpt of that here:
The main driver of this is improved GPU price performance. The actual GPT-3 training run used NVIDIA V100 GPUs, but OPT-175B and other more recent GPT-3-like models were trained on A100 GPUs. A100 and V100 GPUs currently have a similar price on Google Cloud. However, A100 can be up to six times more efficient than V100, since
V100 has about three times slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
V100 has less than half the memory capacity of the 80GB A100 chip, at 32 GB, therefore requiring over two times the number of chips to fit a model in memory.
OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.
As an aside, it’s still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?
For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I’m curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name “Stable Diffusion” being a model resulting from diffusion is funny.)
I think the training compute requirement and hardware improvements are two key differences here. Epoch’s database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.
As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.
There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven’t looked into that.
I’m curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.
I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea.
I didn’t explicitly compare to GPT-2 but I’d say that this section (“Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret”) is implicitly explaining why GPT-3′s release strategy succeeded more than GPT-2′s release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.
how sensitive do you think your conclusions are to the choice of using GPT-3 as your point of reference?
I tried to qualify claims to account for using a single point of reference, e.g. just talk about pre-trained language models rather than all ML models. However, as I note in the final section of this post, my claims about the broader implications of this research have the lowest confident and resilience. It feels really hard to quantify the sensitivity overall (I’m not sure if you have a way to measure this in mind). But my off-the-cuff intuition is that if my language model case studies turn out to not at all generalise in the way that I assumed, my % likelihoods for the generalised claims throughout the sequence would change by 20 percentage points on average.
Thanks Haydn!
I just want to add caution on taking the extrapolations too seriously. The linear extrapolation is not my all-things-considered view of what is going to happen, and the shaded region is just the uncertainty in the linear regression trendline rather than my subjective uncertainty in the estimates.
I agree with you inasmuch as I expect the initial costs of state-of-the-art models to get well out of reach for actors other than big tech (if we include labs with massive investment like OpenAI), and states, by 2030. I still have significant uncertainty about this though. Plausibly, the biggest players in AI won’t be willing to spend $100M just on the computation for a final training run as soon as 2030. We still don’t have a great understanding of what hardware and software progress will be like in future (though Epoch has worked on this). Maybe efficiency improves faster than expected and/or there just won’t be worthwhile gains from spending so much in order to compete.
Also, I’d like to be clear about what it means to “keep up”. I expect those lower-resourced types of actors won’t keep up in the sense that they won’t be the first to advance state-of-the-art on the most important AI capabilities. But the cost of a given ML system falls over time and that is a big driver of how AI capabilities diffuse.
TL;DR are there any forum posts or similarly accessible writing that clarify different notions of x-risk? If not, does it seem worth writing?
My impression is that prevailing notions of x-risk (i.e. what it means, not specific cause areas) have broadened or shifted over time, but there’s a lack of clarity about what notion/definition people are basing arguments on in discourse.
At the same time, discussion of x-risk sometimes seems too narrow. For example, in the most recent 80K podcast with Will MacAskill, they at one point talk about x-risk in terms of literal 100% human annihilation. IMO this is one of the least relevant notions of x-risk, for cause prioritisation purposes. Perhaps there’s a bias because literal human extinction is the most concrete/easy to explain/easy to reason about? Nowadays I frame longtermist cause prioritisation more like “what could cause the largest losses to the expected value of the future” than “what could plausibly annihilate humanity”.
Bostrom (2002) defined x-risk as “one where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential”. There is also a taxonomy in section 3 of the paper. Torres (2019) explains and analyses five different definitions of x-risk, which I think all have some merit.
To be clear I think many people have internalised broader notions of x-risk in their thoughts and arguments, both generally and for specific cause areas. I just think it could use some clarification and a call for people to clarify themselves, e.g. in a forum post.