I’m curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.
I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea.
I didn’t explicitly compare to GPT-2 but I’d say that this section (“Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret”) is implicitly explaining why GPT-3′s release strategy succeeded more than GPT-2′s release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.
I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea.
I didn’t explicitly compare to GPT-2 but I’d say that this section (“Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret”) is implicitly explaining why GPT-3′s release strategy succeeded more than GPT-2′s release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.