You can find my take on that in this section, but I’ll put an excerpt of that here:
The main driver of this is improved GPU price performance. The actual GPT-3 training run used NVIDIA V100 GPUs, but OPT-175B and other more recent GPT-3-like models were trained on A100 GPUs. A100 and V100 GPUs currently have a similar price on Google Cloud. However, A100 can be up to six times more efficient than V100, since
V100 has about three times slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
V100 has less than half the memory capacity of the 80GB A100 chip, at 32 GB, therefore requiring over two times the number of chips to fit a model in memory.
OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.
As an aside, it’s still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?
You can find my take on that in this section, but I’ll put an excerpt of that here:
OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.
As an aside, it’s still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?