Peter Wildeford comments on Understanding the diffusion of large language models: summary

Peter Wildeford Dec 23, 2022, 10:10 PM
8 points
1 ∶ 0
I know you mention this as an area for further investigation, but I wonder if it is more central to your main point—how sensitive do you think your conclusions are to the choice of using GPT-3 as your point of reference?

For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I’m curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name “Stable Diffusion” being a model resulting from diffusion is funny.)

I’m curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.

In fact, I think that GPT-3 may have been surprisingly hard to replicate relative to other models, especially in light of other people’s comments (and also my understanding) that OPT-175B is notably worse than GPT-3 at useful tasks. I’m curious if that’s true (is GPT-3 indeed hard to replicate and is it harder than other models) and why that is (why GPT-3 appears so hard to replicate well).

Similarly, I’d also be curious if you’ve looked much at to what extent Anthropic has been able to replicate GPT-3? It seems like they also have (closed) access to similar models. I’d also be curious what Redwood Research has been able to create, given that I think they have substantially less resources than Anthropic, OpenAI, or Meta. I imagine some labs being in possession of non-published models might complicate your analysis somewhat?
- taoroalin@gmail.com Jan 25, 2023, 11:19 PM
  3 points
  1 ∶ 0
  Parent
  Anthropic has the same people who originally made GPT3, so them replicating gpt3 is a different sort of diffusion than others. I’d guess they internally matched gpt3 when they’d existed for ~6 months, so mid 2021. Claude, their public beta model, has similar performance to chatgpt in conversation but is not benchmarked.
  
  Redwood Research has not attempted to train any large language models from scratch, and hasn’t even reproduced gpt-2. Redwood does have employees who’ve worked at OpenAI, and likely could reproduce gpt3 if needed.
- Ben Cottier Jan 3, 2023, 12:37 PM
  3 points
  0 ∶ 0
  Parent
  how sensitive do you think your conclusions are to the choice of using GPT-3 as your point of reference?
  I tried to qualify claims to account for using a single point of reference, e.g. just talk about pre-trained language models rather than all ML models. However, as I note in the final section of this post, my claims about the broader implications of this research have the lowest confident and resilience. It feels really hard to quantify the sensitivity overall (I’m not sure if you have a way to measure this in mind). But my off-the-cuff intuition is that if my language model case studies turn out to not at all generalise in the way that I assumed, my % likelihoods for the generalised claims throughout the sequence would change by 20 percentage points on average.
- Ben Cottier Jan 3, 2023, 12:25 PM
  3 points
  0 ∶ 0
  Parent
  I’m curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.
  I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea.
  I didn’t explicitly compare to GPT-2 but I’d say that this section (“Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret”) is implicitly explaining why GPT-3′s release strategy succeeded more than GPT-2′s release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.
- Ben Cottier Jan 3, 2023, 12:17 PM
  3 points
  0 ∶ 0
  Parent
  For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I’m curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name “Stable Diffusion” being a model resulting from diffusion is funny.)
  I think the training compute requirement and hardware improvements are two key differences here. Epoch’s database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.
  As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.
  There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven’t looked into that.
  - taoroalin@gmail.com Jan 25, 2023, 11:15 PM
    1 point
    0 ∶ 0
    Parent
    Gpt3 is algorithmically easier to reproduce, in the sense that it’s a simpler architecture with fewer and more robust hyperparameters. It’s engineering is harder to reproduce, because it needs more model parallelism. People have estimated gpt3 to cost 5M, and StableDiffusion to cost 300k, which is similar to the 36k number you quoted