What about for a very large-scale application of a GPT-3-like model—for example, generating text equivalent to 1% of global Twitter activity for one year, or assisting one million software developers with coding for one year? I estimate that deploying a model like BLOOM in these ways would be 20% of the cost of developing the model (90% CI: 10 to 68%), in terms of the dollar cost of compute alone. This means that deployment is most likely much less prohibitive than development. But it means I give a 5% chance that for the largest-scale applications, the cost of deploying the model is at least 68% of the cost of developing the model, which would make deployment similarly prohibitive.
The claims about the cost of the specific deployment scenarios (which were oversimplified to begin with) may still be fairly accurate. But in terms of the intent behind the estimates I made, I think I greatly underestimated the largest scale of deployment for LLMs, a scale which is becoming more common and which I understand a little better. I now think that for the largest, most commercially successful LLMs, the total compute spent on deployment is much larger than in development.
My update was mostly influenced by several more sources (and more credible sources than the ones I reviewed in the post) suggesting that the total compute that major AI companies spend on inference is significantly larger then the total compute spent on training and experimentation:
https://arxiv.org/pdf/2111.00364.pdf, p.3, Fig. 3 caption: “At Facebook, we observe a rough power capacity breakdown of 10:20:70 for AI infrastructures devoted to the three key phases — Experimentation, Training, and Inference”. Also, “Considering the primary stages of the ML pipeline end-to-end, the energy footprint of RM1 is roughly 31:29:40 over Data, Experimentation/Training, and Inference”.[1][2]
https://arxiv.org/abs/2204.05149, p.7: “Across all three years, about ⅗ of ML energy use is for inference and ⅖ for training. These measurements include all ML energy usage: research, development, testing, and production.”
However, this doesn’t significantly update my conclusion about the importance of focusing on development rather than deployment as a target of intervention (point 2c in the Key Takeaways). This is because of theother strong reasons to focus on development that I mention. I would revise point 2c to say that, even if the amount of compute is smaller in total, the compute you have to spend on training tends to be more up-front and all-or-nothing than deployment which can be scaled quite smoothly. This creates a greater barrier.
I have edited the post to point out this comment, but for the sake of posterity and prioritizing other projects, I won’t be updating the rest of the post.
Power and energy usage are not 1-1 with compute usage, especially over time as new hardware improves energy efficiency. But there is a clear relationship: computation requires running GPUs for some time, which consumes a fairly consistent amount of average power. I don’t expect that improvements in energy efficiency have a big impact on the ratio of development and deployment compute.
RM1 denotes one of Facebook’s six models that “account for a vast majority of compute resources for the overall inference predictions at Facebook, serving billions of users world wide” (see footnote 4 on p.4). RM1 is the single most carbon-intensive model out of these six models (see Fig 4 on p.4).
I have made a big update regarding this claim:
The claims about the cost of the specific deployment scenarios (which were oversimplified to begin with) may still be fairly accurate. But in terms of the intent behind the estimates I made, I think I greatly underestimated the largest scale of deployment for LLMs, a scale which is becoming more common and which I understand a little better. I now think that for the largest, most commercially successful LLMs, the total compute spent on deployment is much larger than in development.
My update was mostly influenced by several more sources (and more credible sources than the ones I reviewed in the post) suggesting that the total compute that major AI companies spend on inference is significantly larger then the total compute spent on training and experimentation:
https://arxiv.org/pdf/2111.00364.pdf, p.3, Fig. 3 caption: “At Facebook, we observe a rough power capacity breakdown of 10:20:70 for AI infrastructures devoted to the three key phases — Experimentation, Training, and Inference”. Also, “Considering the primary stages of the ML pipeline end-to-end, the energy footprint of RM1 is roughly 31:29:40 over Data, Experimentation/Training, and Inference”.[1][2]
https://arxiv.org/abs/2204.05149, p.7: “Across all three years, about ⅗ of ML energy use is for inference and ⅖ for training. These measurements include all ML energy usage: research, development, testing, and production.”
https://www.semianalysis.com/p/the-inference-cost-of-search-disruption: “inference costs far exceed training costs when deploying a model at any reasonable scale. In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis.”
However, this doesn’t significantly update my conclusion about the importance of focusing on development rather than deployment as a target of intervention (point 2c in the Key Takeaways). This is because of theother strong reasons to focus on development that I mention. I would revise point 2c to say that, even if the amount of compute is smaller in total, the compute you have to spend on training tends to be more up-front and all-or-nothing than deployment which can be scaled quite smoothly. This creates a greater barrier.
I have edited the post to point out this comment, but for the sake of posterity and prioritizing other projects, I won’t be updating the rest of the post.
Power and energy usage are not 1-1 with compute usage, especially over time as new hardware improves energy efficiency. But there is a clear relationship: computation requires running GPUs for some time, which consumes a fairly consistent amount of average power. I don’t expect that improvements in energy efficiency have a big impact on the ratio of development and deployment compute.
RM1 denotes one of Facebook’s six models that “account for a vast majority of compute resources for the overall inference predictions at Facebook, serving billions of users world wide” (see footnote 4 on p.4). RM1 is the single most carbon-intensive model out of these six models (see Fig 4 on p.4).