Peter Wildeford comments on Understanding the diffusion of large language models: summary

Peter Wildeford 23 Dec 2022 22:13 UTC
4 points
0 ∶ 0
Apologies if I could easily figure it out on my own by diving into your source material, but I’m curious what factors allowed OPT-175B to be trained about 10x as cheaply as GPT-3?
- Ben Cottier 3 Jan 2023 11:58 UTC
  3 points
  0 ∶ 0
  Parent
  You can find my take on that in this section, but I’ll put an excerpt of that here:
  The main driver of this is improved GPU price performance. The actual GPT-3 training run used NVIDIA V100 GPUs, but OPT-175B and other more recent GPT-3-like models were trained on A100 GPUs. A100 and V100 GPUs currently have a similar price on Google Cloud. However, A100 can be up to six times more efficient than V100, since
  V100 has about three times slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
  V100 has less than half the memory capacity of the 80GB A100 chip, at 32 GB, therefore requiring over two times the number of chips to fit a model in memory.
  OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.
  As an aside, it’s still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?
  What links here?
  - Ben Cottier's comment on Understanding the diffusion of large language models: summary by Ben Cottier (3 Jan 2023 12:17 UTC; 3 points)