Ben Cottier

Karma: 374

At Epoch, helping to clarify when and how transformative AI capabilities will be developed.

Previously a Research Fellow on the AI Governance & Strategy team at Rethink Priorities.

Ben Cottier 19 Feb 2020 22:16 UTC
5 points
0 ∶ 0
on: Ben Cottier’s Shortform
TL;DR are there any forum posts or similarly accessible writing that clarify different notions of x-risk? If not, does it seem worth writing?
My impression is that prevailing notions of x-risk (i.e. what it means, not specific cause areas) have broadened or shifted over time, but there’s a lack of clarity about what notion/definition people are basing arguments on in discourse.
At the same time, discussion of x-risk sometimes seems too narrow. For example, in the most recent 80K podcast with Will MacAskill, they at one point talk about x-risk in terms of literal 100% human annihilation. IMO this is one of the least relevant notions of x-risk, for cause prioritisation purposes. Perhaps there’s a bias because literal human extinction is the most concrete/easy to explain/easy to reason about? Nowadays I frame longtermist cause prioritisation more like “what could cause the largest losses to the expected value of the future” than “what could plausibly annihilate humanity”.
Bostrom (2002) defined x-risk as “one where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential”. There is also a taxonomy in section 3 of the paper. Torres (2019) explains and analyses five different definitions of x-risk, which I think all have some merit.
To be clear I think many people have internalised broader notions of x-risk in their thoughts and arguments, both generally and for specific cause areas. I just think it could use some clarification and a call for people to clarify themselves, e.g. in a forum post.

Understanding the diffusion of large language models: summary

Ben Cottier21 Dec 2022 13:49 UTC

127 points

18 comments22 min readEA link

Background for “Understanding the diffusion of large language models”

Ben Cottier21 Dec 2022 13:49 UTC

12 points

0 comments23 min readEA link

GPT-3-like models are now much easier to access and deploy than to develop

Ben Cottier21 Dec 2022 13:49 UTC

22 points

3 comments19 min readEA link

The replication and emulation of GPT-3

Ben Cottier21 Dec 2022 13:49 UTC

14 points

0 comments33 min readEA link

Drivers of large language model diffusion: incremental research, publicity, and cascades

Ben Cottier21 Dec 2022 13:50 UTC

21 points

0 comments29 min readEA link

Publication decisions for large language models, and their impacts

Ben Cottier21 Dec 2022 13:50 UTC

14 points

0 comments16 min readEA link

Implications of large language model diffusion for AI governance

Ben Cottier21 Dec 2022 13:50 UTC

14 points

0 comments38 min readEA link

Questions for further investigation of AI diffusion

Ben Cottier21 Dec 2022 13:50 UTC

28 points

0 comments11 min readEA link

Conclusion and Bibliography for “Understanding the diffusion of large language models”

Ben Cottier21 Dec 2022 13:50 UTC

12 points

0 comments11 min readEA link

Ben Cottier 3 Jan 2023 10:53 UTC
3 points
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Understanding the diffusion of large language models: summary
Pretty important detail! Thanks, I’ve changed it.

Ben Cottier 3 Jan 2023 11:18 UTC
3 points
0 ∶ 0
in reply to: Gavin’s comment on: Understanding the diffusion of large language models: summary
Thanks for raising this. On reflection, I think if I had started this project now (including re-considering my definition of “successful replication”) I probably would not have classed OPT-175B as a successful replication. I probably should flag this clearly in the post.
As noted in point 2(d) of the final section of the post, I was more-or-less sitting on this report for a few months. I made significant revisions during that period, but I was paying less attention to new evidence than before, so I missed some evidence that was important to update on.

Ben Cottier 3 Jan 2023 11:34 UTC
3 points
1 ∶ 0
in reply to: Gavin’s comment on: Understanding the diffusion of large language models: summary
this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
We follow GPT-3 (Brown
et al., 2020) by using their prompts and overall ex-
perimental setup. We compare primarily to GPT-3,
having aimed to re-implement their evaluation set-
tings, but include reported performance of other
LLMs on a per-task basis when available (Lieber
et al., 2021; Rae et al., 2021; Hoffmann et al., 2022;
Black et al., 2022)
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
In WIC, we see that the OPT models always out-
perform the GPT-3 models, though the numbers
reported by Brown et al. (2020) also seem question-
able, given WIC being a binary classification task.
But p.3 also mentions
For MultiRC, we are unable to replicate the GPT-3
results using the Davinci API within our evalua-
tion setup [...]
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.

Ben Cottier 3 Jan 2023 11:58 UTC
3 points
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Understanding the diffusion of large language models: summary
You can find my take on that in this section, but I’ll put an excerpt of that here:
The main driver of this is improved GPU price performance. The actual GPT-3 training run used NVIDIA V100 GPUs, but OPT-175B and other more recent GPT-3-like models were trained on A100 GPUs. A100 and V100 GPUs currently have a similar price on Google Cloud. However, A100 can be up to six times more efficient than V100, since
1. V100 has about three times slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
2. V100 has less than half the memory capacity of the 80GB A100 chip, at 32 GB, therefore requiring over two times the number of chips to fit a model in memory.
OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.
As an aside, it’s still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?
What links here?
- Ben Cottier's comment on Understanding the diffusion of large language models: summary by Ben Cottier (3 Jan 2023 12:17 UTC; 3 points)

Ben Cottier 3 Jan 2023 12:17 UTC
3 points
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Understanding the diffusion of large language models: summary
For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I’m curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name “Stable Diffusion” being a model resulting from diffusion is funny.)
I think the training compute requirement and hardware improvements are two key differences here. Epoch’s database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.
As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.
There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven’t looked into that.

Ben Cottier 3 Jan 2023 12:25 UTC
3 points
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Understanding the diffusion of large language models: summary
I’m curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.
I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea.
I didn’t explicitly compare to GPT-2 but I’d say that this section (“Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret”) is implicitly explaining why GPT-3′s release strategy succeeded more than GPT-2′s release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.

Ben Cottier 3 Jan 2023 12:37 UTC
3 points
0 ∶ 0
in reply to: Peter Wildeford’s comment on: Understanding the diffusion of large language models: summary
how sensitive do you think your conclusions are to the choice of using GPT-3 as your point of reference?
I tried to qualify claims to account for using a single point of reference, e.g. just talk about pre-trained language models rather than all ML models. However, as I note in the final section of this post, my claims about the broader implications of this research have the lowest confident and resilience. It feels really hard to quantify the sensitivity overall (I’m not sure if you have a way to measure this in mind). But my off-the-cuff intuition is that if my language model case studies turn out to not at all generalise in the way that I assumed, my % likelihoods for the generalised claims throughout the sequence would change by 20 percentage points on average.

Trends in the dollar training cost of machine learning systems

Ben Cottier1 Feb 2023 14:48 UTC

63 points

3 comments1 min readEA link

Ben Cottier 2 Feb 2023 21:40 UTC
10 points
2 ∶ 0
in reply to: HaydnBelfield’s comment on: Trends in the dollar training cost of machine learning systems
Thanks Haydn!
I just want to add caution on taking the extrapolations too seriously. The linear extrapolation is not my all-things-considered view of what is going to happen, and the shaded region is just the uncertainty in the linear regression trendline rather than my subjective uncertainty in the estimates.
I agree with you inasmuch as I expect the initial costs of state-of-the-art models to get well out of reach for actors other than big tech (if we include labs with massive investment like OpenAI), and states, by 2030. I still have significant uncertainty about this though. Plausibly, the biggest players in AI won’t be willing to spend $100M just on the computation for a final training run as soon as 2030. We still don’t have a great understanding of what hardware and software progress will be like in future (though Epoch has worked on this). Maybe efficiency improves faster than expected and/or there just won’t be worthwhile gains from spending so much in order to compete.

Ben Cottier 2 Feb 2023 21:40 UTC
8 points
1 ∶ 0
in reply to: HaydnBelfield’s comment on: Trends in the dollar training cost of machine learning systems
Also, I’d like to be clear about what it means to “keep up”. I expect those lower-resourced types of actors won’t keep up in the sense that they won’t be the first to advance state-of-the-art on the most important AI capabilities. But the cost of a given ML system falls over time and that is a big driver of how AI capabilities diffuse.