Your reasoning in footnote 4 is sound, but note that practitioners often complain that OPT is much worse than GPT-3 (or even GPT-NeoX) in qualitative / practical terms. Benchmark goodharting is real.
(Even so, this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020.)
I also wanted to say that from talking to a bunch of people and reading ML blogs/reddits/Twitter that my impression was that OPT is much worse than GPT-3, despite similar performance on some of the benchmarks, so I think this comparison is pretty off.
this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
We follow GPT-3 (Brown et al., 2020) by using their prompts and overall ex- perimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation set- tings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
In WIC, we see that the OPT models always out- perform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem question- able, given WIC being a binary classification task.
But p.3 also mentions
For MultiRC, we are unable to replicate the GPT-3 results using the Davinci API within our evalua- tion setup [...]
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.
Thanks for raising this. On reflection, I think if I had started this project now (including re-considering my definition of “successful replication”) I probably would not have classed OPT-175B as a successful replication. I probably should flag this clearly in the post.
As noted in point 2(d) of the final section of the post, I was more-or-less sitting on this report for a few months. I made significant revisions during that period, but I was paying less attention to new evidence than before, so I missed some evidence that was important to update on.
Your reasoning in footnote 4 is sound, but note that practitioners often complain that OPT is much worse than GPT-3 (or even GPT-NeoX) in qualitative / practical terms. Benchmark goodharting is real.
(Even so, this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020.)
I also wanted to say that from talking to a bunch of people and reading ML blogs/reddits/Twitter that my impression was that OPT is much worse than GPT-3, despite similar performance on some of the benchmarks, so I think this comparison is pretty off.
https://twitter.com/sir_deenicus/status/1606360611524206592
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
But p.3 also mentions
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.
Thanks for raising this. On reflection, I think if I had started this project now (including re-considering my definition of “successful replication”) I probably would not have classed OPT-175B as a successful replication. I probably should flag this clearly in the post.
As noted in point 2(d) of the final section of the post, I was more-or-less sitting on this report for a few months. I made significant revisions during that period, but I was paying less attention to new evidence than before, so I missed some evidence that was important to update on.