this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
We follow GPT-3 (Brown et al., 2020) by using their prompts and overall ex- perimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation set- tings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
In WIC, we see that the OPT models always out- perform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem question- able, given WIC being a binary classification task.
But p.3 also mentions
For MultiRC, we are unable to replicate the GPT-3 results using the Davinci API within our evalua- tion setup [...]
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.
That’s a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to “GPT-3” (or “GPT” in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:
Also on p.3 they refer to “numbers reported by Brown et al. (2020)”
But p.3 also mentions
It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.
In contrast, the paper talks about “Davinci” for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that “We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).” I didn’t include these other evaluations (e.g. Bias and Toxicity) in my analysis; I’m just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.