Let be the number of parameters in the model, be the number of data tokens it is trained on, be the number of times the model is deployed (e.g. the number of questions it is asked) and be the number of inference steps each time it is deployed (e.g. the number of tokens per answer). Then this approximately works out to:[9]
Note that scaling up the number of parameters, , increases both pre-training compute and inference compute, because you need to use those parameters each time you run a forward pass in your model.
Thanks for catching that — a lot of symbols in the appendix were lost when converting the post for the forum, so I’ve edited it to add them back in.
Several variables are not showing up in the text.
Thanks for catching that — a lot of symbols in the appendix were lost when converting the post for the forum, so I’ve edited it to add them back in.