technicalities comments on Understanding the diffusion of large language models: summary

technicalities 21 Dec 2022 15:56 UTC
24 points
9 ∶ 0
Your reasoning in footnote 4 is sound, but note that practitioners often complain that OPT is much worse than GPT-3 (or even GPT-NeoX) in qualitative / practical terms. Benchmark goodharting is real.
(Even so, this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020.)
What links here?
- Understanding the diffusion of large language models: summary by Ben Cottier (21 Dec 2022 13:49 UTC; 127 points)
- Understanding the diffusion of large language models: summary by Ben Cottier (LessWrong; 16 Jan 2023 1:37 UTC; 26 points)
- Habryka [Deactivated] 22 Dec 2022 4:04 UTC
  10 points
  1 ∶ 0
  Parent
  I also wanted to say that from talking to a bunch of people and reading ML blogs/reddits/Twitter that my impression was that OPT is much worse than GPT-3, despite similar performance on some of the benchmarks, so I think this comparison is pretty off.
- technicalities 24 Dec 2022 0:06 UTC
  4 points
  0 ∶ 0
  Parent
  https://twitter.com/sir_deenicus/status/1606360611524206592
- Ben Cottier 3 Jan 2023 11:34 UTC
  3 points
  1 ∶ 0
  Parent
  this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020
  That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
  We follow GPT-3 (Brown
  et al., 2020) by using their prompts and overall ex-
  perimental setup. We compare primarily to GPT-3,
  having aimed to re-implement their evaluation set-
  tings, but include reported performance of other
  LLMs on a per-task basis when available (Lieber
  et al., 2021; Rae et al., 2021; Hoffmann et al., 2022;
  Black et al., 2022)
  Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
  In WIC, we see that the OPT models always out-
  perform the GPT-3 models, though the numbers
  reported by Brown et al. (2020) also seem question-
  able, given WIC being a binary classification task.
  But p.3 also mentions
  For MultiRC, we are unable to replicate the GPT-3
  results using the Davinci API within our evalua-
  tion setup [...]
  It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
  In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.
- Ben Cottier 3 Jan 2023 11:18 UTC
  3 points
  0 ∶ 0
  Parent
  Thanks for raising this. On reflection, I think if I had started this project now (including re-considering my definition of “successful replication”) I probably would not have classed OPT-175B as a successful replication. I probably should flag this clearly in the post.
  As noted in point 2(d) of the final section of the post, I was more-or-less sitting on this report for a few months. I made significant revisions during that period, but I was paying less attention to new evidence than before, so I missed some evidence that was important to update on.