For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I’m curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name “Stable Diffusion” being a model resulting from diffusion is funny.)
I think the training compute requirement and hardware improvements are two key differences here. Epoch’s database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.
As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.
There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven’t looked into that.
Gpt3 is algorithmically easier to reproduce, in the sense that it’s a simpler architecture with fewer and more robust hyperparameters. It’s engineering is harder to reproduce, because it needs more model parallelism. People have estimated gpt3 to cost 5M, and StableDiffusion to cost 300k, which is similar to the 36k number you quoted
I think the training compute requirement and hardware improvements are two key differences here. Epoch’s database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.
As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.
There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven’t looked into that.
Gpt3 is algorithmically easier to reproduce, in the sense that it’s a simpler architecture with fewer and more robust hyperparameters. It’s engineering is harder to reproduce, because it needs more model parallelism. People have estimated gpt3 to cost 5M, and StableDiffusion to cost 300k, which is similar to the 36k number you quoted