Within Epoch there is a months-long debate about how we should report growth rates for certain key quantities such as the amount of compute used for training runs.
I have been an advocate of an unusual choice: orders-of-magnitude per year (abbreviated OOMs/year). Why is that? Let’s look at other popular choices.
Doubling times. This has become the standard in AI forecasting, and its a terrible metric. On the positive, it is an intuitive metric that both policy makers and researchers are familiar with. But it is absolutely horrid to make calculations. For example, if I know that the cost of AI training runs is doubling every 0.6 years, and the FLOP/$ is doubling every 2.5 years, then the FLOP per training run is doubling every (10.6+12.5)−1 years, which is very difficult to solve in your head! [1][2]
Percent growth. This is a choice often favoured in economics, where eg the growth rate of GDP is often reported as 3%, etc. Unlike doubling times, percent growth composes nicely—you just have to add them up! [3] However, I also find percent changes somewhat prone to confusion. For instance, when I tell people that model size has increased 200% since X years ago, I sometimes have had people misunderstand this as saying that it has increased by a factor of 2.
Ultimately, a very common operation that I find myself doing in my head is “if the effective FLOP used in AI training runs grows at a certain rate, how quickly will we traverse from the scale of current training runs (1e25 FLOP) to a certain threshold (eg 1e30)”. OOMs/year makes this computation easy, even if I need to account for multiple factors such as hardware increases, investment and algorithmic improvements. Eg if these grow respectively as 0.4 OOM/year, 0.1 OOM/year and 0.5 OOM/year, then I know the total effective growth is 1.0 OOM/year, and it will take 5 years to cross that 5 OOM scale gap. And if investment suddenly stopped growing, then I would be able to quickly understand that the pace would be halved, and the gap would then take 10 years to cross.
Sadly OOMs/year is uncommon, and both researchers and policy makers struggle to understand it. I think this is a missed opportunity, and that AI forecasting would be easier to reason about if we moved to it, or at the very least abandoned the very badly behaved doubling time framing.
What do you think? Do you agree we should move past doubling times to a better choice? Which choice would you favour?
I won’t enter into the technical details, but it also has some very unintuitive results when combining the results with uncertainty. Once we had a discussion because some doubling times looked like they had to be wrong. They spanned from days to years! But it turns out that doubling times are very sensitive to noise, which led to our intuitions being wrong.
This is because have the useful property that ln(1+g)≈g for percent changes g close to zero, which links percent growth to a straightforward model of growth such as xt=x0exp{gt}. But this approximation breaks down for percent changes over 1 (which are often seen in AI forecasting).
Agree that it’s easier to talk about (change)/(time) rather than (time)/(change). As you say, (change)/(time) adds better. And agree that % growth rates are terrible for a bunch of reasons once you are talking about rates >50%.
I’d weakly advocate for “doublings per year:” (i) 1 doubling / year is more like a natural unit, that’s already a pretty high rate of growth, and it’s easier to talk about multiple doublings per year than a fraction of an OOM per year, (ii) there is a word for “doubling” and no word for “increased by an OOM,” (iii) I think the arithmetic is easier.
But people might find factors of 10 so much more intuitive than factors of 2 that OOMs/year is better. I suspect this is increasingly true as you are talking more to policy makers and less to people in ML, but might even be true in ML since people are so used to quoting big numbers in scientific notation.
(I’d probably defend my definitional choice for slow takeoff, but that seems like a different topic.)
What about factor increase per year, reported alongside a second number to show how the increases compose (e.g. the factor increase per decade)? So “compute has been increasing by 1.4x per year, or 28x per decade” or sth.
The main problem with OOMs is fractional OOMs, like your recent headline of “0.1 OOMs”. Very few people are going to interpret this right, where they’d do much better with “2 OOMs”.
Factor increase per year is the way we are reporting growth rates by default now in the dashboard.
And I agree it will be better interpreted by the public. On the other hand, multiplying numbers is hard, so it’s not as nice for mental arithmetic. And thinking logarithmically puts you in the right frame of mind.
Saying that GPT-4 was trained on x100 more compute than GPT-3 invokes GPT-3 being 100 times better, whereas I think saying it was trained on 2 OOM more compute gives you a better picture of the expected improvement.
I might be wrong here.
In any case, it is still a better choice than doubling times.
Within Epoch there is a months-long debate about how we should report growth rates for certain key quantities such as the amount of compute used for training runs.
I have been an advocate of an unusual choice: orders-of-magnitude per year (abbreviated OOMs/year). Why is that? Let’s look at other popular choices.
Doubling times. This has become the standard in AI forecasting, and its a terrible metric. On the positive, it is an intuitive metric that both policy makers and researchers are familiar with. But it is absolutely horrid to make calculations. For example, if I know that the cost of AI training runs is doubling every 0.6 years, and the FLOP/$ is doubling every 2.5 years, then the FLOP per training run is doubling every (10.6+12.5)−1 years, which is very difficult to solve in your head! [1] [2]
Percent growth. This is a choice often favoured in economics, where eg the growth rate of GDP is often reported as 3%, etc. Unlike doubling times, percent growth composes nicely—you just have to add them up! [3] However, I also find percent changes somewhat prone to confusion. For instance, when I tell people that model size has increased 200% since X years ago, I sometimes have had people misunderstand this as saying that it has increased by a factor of 2.
Ultimately, a very common operation that I find myself doing in my head is “if the effective FLOP used in AI training runs grows at a certain rate, how quickly will we traverse from the scale of current training runs (1e25 FLOP) to a certain threshold (eg 1e30)”. OOMs/year makes this computation easy, even if I need to account for multiple factors such as hardware increases, investment and algorithmic improvements. Eg if these grow respectively as 0.4 OOM/year, 0.1 OOM/year and 0.5 OOM/year, then I know the total effective growth is 1.0 OOM/year, and it will take 5 years to cross that 5 OOM scale gap. And if investment suddenly stopped growing, then I would be able to quickly understand that the pace would be halved, and the gap would then take 10 years to cross.
Sadly OOMs/year is uncommon, and both researchers and policy makers struggle to understand it. I think this is a missed opportunity, and that AI forecasting would be easier to reason about if we moved to it, or at the very least abandoned the very badly behaved doubling time framing.
What do you think? Do you agree we should move past doubling times to a better choice? Which choice would you favour?
I won’t enter into the technical details, but it also has some very unintuitive results when combining the results with uncertainty. Once we had a discussion because some doubling times looked like they had to be wrong. They spanned from days to years! But it turns out that doubling times are very sensitive to noise, which led to our intuitions being wrong.
I’d also argue that Christiano’s operationalization of slow takeoff is a terrible definition, and that a big part of that terribleness stems from doubling times being very unintuitive.
This is because have the useful property that ln(1+g)≈g for percent changes g close to zero, which links percent growth to a straightforward model of growth such as xt=x0exp{gt}. But this approximation breaks down for percent changes over 1 (which are often seen in AI forecasting).
Agree that it’s easier to talk about (change)/(time) rather than (time)/(change). As you say, (change)/(time) adds better. And agree that % growth rates are terrible for a bunch of reasons once you are talking about rates >50%.
I’d weakly advocate for “doublings per year:” (i) 1 doubling / year is more like a natural unit, that’s already a pretty high rate of growth, and it’s easier to talk about multiple doublings per year than a fraction of an OOM per year, (ii) there is a word for “doubling” and no word for “increased by an OOM,” (iii) I think the arithmetic is easier.
But people might find factors of 10 so much more intuitive than factors of 2 that OOMs/year is better. I suspect this is increasingly true as you are talking more to policy makers and less to people in ML, but might even be true in ML since people are so used to quoting big numbers in scientific notation.
(I’d probably defend my definitional choice for slow takeoff, but that seems like a different topic.)
What about factor increase per year, reported alongside a second number to show how the increases compose (e.g. the factor increase per decade)? So “compute has been increasing by 1.4x per year, or 28x per decade” or sth.
The main problem with OOMs is fractional OOMs, like your recent headline of “0.1 OOMs”. Very few people are going to interpret this right, where they’d do much better with “2 OOMs”.
Factor increase per year is the way we are reporting growth rates by default now in the dashboard.
And I agree it will be better interpreted by the public. On the other hand, multiplying numbers is hard, so it’s not as nice for mental arithmetic. And thinking logarithmically puts you in the right frame of mind.
Saying that GPT-4 was trained on x100 more compute than GPT-3 invokes GPT-3 being 100 times better, whereas I think saying it was trained on 2 OOM more compute gives you a better picture of the expected improvement.
I might be wrong here.
In any case, it is still a better choice than doubling times.