I like this post a lot but I will disobey Rapoport’s rules and dive straight into criticism.
Historically, many AI researchers believed that creating general AI would be more about coming up with the right theories of intelligence, but over and over again, researchers eventually found that impressive results only came after the price of computing fell far enough that simple, “blind” techniques began working (Sutton 2019).
I think this is a poor way to describe a reasonable underlying point. Heavier-than-air flying machines were pursued for centuries, but airplanes appeared almost instantly (on a historic scale) after the development of engines with sufficient power density. Nonetheless, it would be confusing to say “flying is more about engine power than the right theories of flight”. Both are required. Indeed, although the Wright brothers were enabled by the arrival of powerful engines, they beat out other would-be inventors (Ader, Maxim, and Langley) who emphasized engine power over flight theory. So a better version of your claim has to be something like “compute quantity drives algorithmic ability; if we independently vary compute (e.g., imagine an exogenous shock) then algorithms follow along”, which (I think) is what you arguing further in the post.
But this also doesn’t seem right. As you observe, algorithmic progress has been comparable to compute progress (both within and outside of AI). You list three “main explanations” for where algorithmic progress ultimately comes from and observe that only two of them explain the similar rates of progress in algorithms and compute. But both of these draw a causal path from compute to algorithms without considering the (to-me-very-natural) explanation that some third thing is driving them both at a similar rate. There are a lot of options for this third thing! Researcher-to-researcher communication timescales, the growth rate of the economy, the individual learning rate of humans, new tech adoption speed, etc. It’s plausible to me that compute and algorithms are currently improving more or less as fast as they can, given their human intermediaries through one or all of these mechanisms.
The causal structure is key here, because the whole idea is to try and figure out when economic growth rates change, and the distinction I’m trying to draw becomes important exactly around the time that you are interested in: when the AI itself is substantially contributing to its own improvement. Because then those contributions could be flowing through at least three broad intermediaries: algorithms (the AI is writing its own code better), compute (the AI improves silicon lithography), or the wider economy (the AI creates useful products that generate money which can be poured into more compute and human researchers).
Of course, even if AI performance is, in principle, predictable as a function of scale, we lack data on how AIs are currently improving on the vast majority of tasks in the economy, hindering our ability to predict when AI will be widely deployed. While we hope this data will eventually become available, for now, if we want to predict important AI capabilities, we are forced to think about this problem from a more theoretical point of view.
Humans have been automating mechanical task for many centuries, and information-processing tasks for many decades. Moore’s law, the growth rate of the thing (compute) that you ague drives everything else, has been stated explicitly for almost 58 years (and presumably applicable for at least a few decade before that). Why are you drawing a distinction between all the information processing that happened in the past and “AI”, which you seem to be taking as a basket of things that have mostly not had a chance to be applied yet (so no data)?
If compute is the central driving force behind AI, and transformative AI (TAI) comes out of something looking like our current paradigm of deep learning, there appear to be a small set of natural parameters that can be used to estimate the arrival of TAI. These parameters are:
The total training compute required to train TAI
The average rate of growth in spending on the largest training runs, which plausibly hits a maximum value at some significant fraction of GWP
The average rate of increase in price-performance for computing hardware
The average rate of growth in algorithmic progress
This list is missing the crucial parameters that would translate the others into what we agree is most notable: economic growth. I think needs to be discussed much more in section 4 for it to be a useful summary/invitation to the models you mention.
I think this is a poor way to describe a reasonable underlying point. Heavier-than-air flying machines were pursued for centuries, but airplanes appeared almost instantly (on a historic scale) after the development of engines with sufficient power density. Nonetheless, it would be confusing to say “flying is more about engine power than the right theories of flight”. Both are required.
I agree. A better phrasing could have emphasized that, although both theory and compute are required, in practice, the compute part seems to be the crucial bottleneck. The ‘theories’ that drive deep learning are famously pretty shallow, and most progress seems to come from tinkering, scaling, and writing code to be more efficient. I’m not aware of any major algorithmic contribution that would not have been possible if it were not for some fundamental analysis from deep learning theory (though perhaps these happen all the time and I’m just not sufficiently familiar to know).
As you observe, algorithmic progress has been comparable to compute progress (both within and outside of AI). You list three “main explanations” for where algorithmic progress ultimately comes from and observe that only two of them explain the similar rates of progress in algorithms and compute. But both of these draw a causal path from compute to algorithms without considering the (to-me-very-natural) explanation that some third thing is driving them both at a similar rate. There are a lot of options for this third thing! Researcher-to-researcher communication timescales, the growth rate of the economy, the individual learning rate of humans, new tech adoption speed, etc.
I think the alternative theory of a common cause is somewhat plausible, but I don’t see any particular reason to believe in it. If there were a common factor that caused progress in computer hardware and algorithms to proceed at a similar rate, why wouldn’t other technologies that shared that cause grow at similar rates?
Hardware progress has been incredibly fast over the last 70 years—indeed, many people say that the speed of computers is by far the most salient difference between the world in 1973 and 2023. And yet algorithmic progress has apparently been similarly rapid, which seems hard to square with a theory of a general factor that causes innovation to proceed at similar rates. Surely there are such bottlenecks that slow down progress in both places, but the question is what explains the coincidence in rates.
Humans have been automating mechanical task for many centuries, and information-processing tasks for many decades. Moore’s law, the growth rate of the thing (compute) that you ague drives everything else, has been stated explicitly for almost 58 years (and presumably applicable for at least a few decade before that). Why are you drawing a distinction between all the information processing that happened in the past and “AI”, which you seem to be taking as a basket of things that have mostly not had a chance to be applied yet (so no data)?
I expect innovation in AI in the future will take a different form than innovation in the past.
When innovating in the past, people generally found a narrow tool or method that improved efficiency in one narrow domain, without being able to substitute for human labor across a wide variety of domains. Occasionally, people stumbled upon general purpose technologies that were unusually useful across a variety of situations, although by-and-large these technologies are quite narrow compared to human resourcefulness. By contrast, I think it’s far more plausible that ML foundation models allow us to create models that can substitute for human labor across nearly all domains at once, once a sufficient scale is reached. This would happen because sufficiently capable foundation models can be cheaply fine-tuned to provide a competent worker for nearly any task.
This is essentially the theory that there’s something like “general intelligence” which causes human labor to be so useful across a very large variety of situations compared to physical capital. This is also related to transfer learning and task-specific bottlenecks that I talked about in Part 2. Given that human population growth seems to have caused a “productivity explosion” in the past (relative to pre-agricultural rates of growth), it seems highly probable that AI could do something similar in the future if it can substitute for human labor.
That said, I’m sympathetic to the model that says future AI innovation will happen without much transfer to other domains, more similar to innovation in the past. This is one reason (among many) that my final probability distribution in the post has a long tail extending many decades into the future.
I’m also a little surprised you think that modeling when we will have systems using similar compute as the human brain is very helpful for modeling when economic growth rates will change. (Like, for sure someone should be doing it, but I’m surprised you’re concentrating on it much.) As you note, the history of automation is one of smooth adoption. And, as I think Eliezer said (roughly), there don’t seem to be many cases where new tech was predicted based on when some low-level metric would exceed the analogous metric in a biological system. The key threshold for recursive feedback loops (*especially* compute-driven ones) is how well they perform on the relevant tasks, not all tasks. And the way in which machines perform tasks usually looks very different than how biological systems do it (bird vs. airplanes, etc.).
If you think that compute is the key bottleneck/driver, then I would expect you to be strongly interested in what the automation of the semiconductor industry would look like.
I’m also a little surprised you think that modeling when we will have systems using similar compute as the human brain is very helpful for modeling when economic growth rates will change.
In this post, when I mentioned human brain FLOP, it was mainly used as a quick estimate of AGI inference costs. However, different methodologies produce similar results (generally within 2 OOMs). A standard formula to estimate compute costs is 6*N per forward pass, where N is the number of parameters. Currently the largest language models have are estimated to be between 100 billion to 1 trillion parameters, which would work out to being 6e11 to 6e12 FLOP/forward pass.
The chinchilla scaling law suggests that inference costs will grow at about half the rate of training compute costs. If we take the estimate of 10^32 training FLOP for TAI (in 2023 algorithms) that I gave in the post, which was itself partly based on the Direct Approach, then we’d expect inference costs to grow to something like 1e15-1e16 per forward pass, although I expect subsequent algorithmic progress will bring this figure down, depending on how much algorithmic progress translates into data efficiency vs. parameter efficiency. A remaining uncertainty here is how a single forward pass for a TAI model will compare to one second of inference for humans, although I’m inclined to think that they’ll be fairly similar.
And, as I think Eliezer said (roughly), there don’t seem to be many cases where new tech was predicted based on when some low-level metric would exceed the analogous metric in a biological system. [...] And the way in which machines perform tasks usually looks very different than how biological systems do it (bird vs. airplanes, etc.).
This data shows that Shorty [hypothetical character introduced earlier in the post] was entirely correct about forecasting heavier-than-air flight. (For details about the data, see appendix.) Whether Shorty will also be correct about forecasting TAI remains to be seen.
In some sense, Shorty has already made two successful predictions: I started writing this argument before having any of this data; I just had an intuition that power-to-weight is the key variable for flight and that therefore we probably got flying machines shortly after having comparable power-to-weight as bird muscle. Halfway through the first draft, I googled and confirmed that yes, the Wright Flyer’s motor was close to bird muscle in power-to-weight. Then, while writing the second draft, I hired an RA, Amogh Nanjajjar, to collect more data and build this graph. As expected, there was a trend of power-to-weight improving over time, with flight happening right around the time bird-muscle parity was reached.
I listed this example in my comment, it was incorrect by an order of magnitude, and it was a retrodiction. “I didn’t look up the data on Google beforehand” does not make it a prediction.
Yeah sorry, I didn’t mean to say this directly contradicted anything you said. It just felt like a good reference that might be helpful to you or other people reading the thread. (In retrospect, I should have said that and/or linked it in response to the mention in your top-level comment instead.)
(Also, personally, I do care about how much effort and selection is required to find good retrodictions like this, so in my book “I didn’t look up the data on Google beforehand” is relevant info. But it would have been way more impressive if someone had been able to pull that off in 1890, and I agree this shouldn’t be confused for that.)
Re “it was incorrect by an order of magnitude”: that seems fine to me. If we could get that sort of precision for predicting TAI, that would be awesome and outperform any other prediction method I know about.
I like this post a lot but I will disobey Rapoport’s rules and dive straight into criticism.
I think this is a poor way to describe a reasonable underlying point. Heavier-than-air flying machines were pursued for centuries, but airplanes appeared almost instantly (on a historic scale) after the development of engines with sufficient power density. Nonetheless, it would be confusing to say “flying is more about engine power than the right theories of flight”. Both are required. Indeed, although the Wright brothers were enabled by the arrival of powerful engines, they beat out other would-be inventors (Ader, Maxim, and Langley) who emphasized engine power over flight theory. So a better version of your claim has to be something like “compute quantity drives algorithmic ability; if we independently vary compute (e.g., imagine an exogenous shock) then algorithms follow along”, which (I think) is what you arguing further in the post.
But this also doesn’t seem right. As you observe, algorithmic progress has been comparable to compute progress (both within and outside of AI). You list three “main explanations” for where algorithmic progress ultimately comes from and observe that only two of them explain the similar rates of progress in algorithms and compute. But both of these draw a causal path from compute to algorithms without considering the (to-me-very-natural) explanation that some third thing is driving them both at a similar rate. There are a lot of options for this third thing! Researcher-to-researcher communication timescales, the growth rate of the economy, the individual learning rate of humans, new tech adoption speed, etc. It’s plausible to me that compute and algorithms are currently improving more or less as fast as they can, given their human intermediaries through one or all of these mechanisms.
The causal structure is key here, because the whole idea is to try and figure out when economic growth rates change, and the distinction I’m trying to draw becomes important exactly around the time that you are interested in: when the AI itself is substantially contributing to its own improvement. Because then those contributions could be flowing through at least three broad intermediaries: algorithms (the AI is writing its own code better), compute (the AI improves silicon lithography), or the wider economy (the AI creates useful products that generate money which can be poured into more compute and human researchers).
Humans have been automating mechanical task for many centuries, and information-processing tasks for many decades. Moore’s law, the growth rate of the thing (compute) that you ague drives everything else, has been stated explicitly for almost 58 years (and presumably applicable for at least a few decade before that). Why are you drawing a distinction between all the information processing that happened in the past and “AI”, which you seem to be taking as a basket of things that have mostly not had a chance to be applied yet (so no data)?
This list is missing the crucial parameters that would translate the others into what we agree is most notable: economic growth. I think needs to be discussed much more in section 4 for it to be a useful summary/invitation to the models you mention.
I agree. A better phrasing could have emphasized that, although both theory and compute are required, in practice, the compute part seems to be the crucial bottleneck. The ‘theories’ that drive deep learning are famously pretty shallow, and most progress seems to come from tinkering, scaling, and writing code to be more efficient. I’m not aware of any major algorithmic contribution that would not have been possible if it were not for some fundamental analysis from deep learning theory (though perhaps these happen all the time and I’m just not sufficiently familiar to know).
I think the alternative theory of a common cause is somewhat plausible, but I don’t see any particular reason to believe in it. If there were a common factor that caused progress in computer hardware and algorithms to proceed at a similar rate, why wouldn’t other technologies that shared that cause grow at similar rates?
Hardware progress has been incredibly fast over the last 70 years—indeed, many people say that the speed of computers is by far the most salient difference between the world in 1973 and 2023. And yet algorithmic progress has apparently been similarly rapid, which seems hard to square with a theory of a general factor that causes innovation to proceed at similar rates. Surely there are such bottlenecks that slow down progress in both places, but the question is what explains the coincidence in rates.
I expect innovation in AI in the future will take a different form than innovation in the past.
When innovating in the past, people generally found a narrow tool or method that improved efficiency in one narrow domain, without being able to substitute for human labor across a wide variety of domains. Occasionally, people stumbled upon general purpose technologies that were unusually useful across a variety of situations, although by-and-large these technologies are quite narrow compared to human resourcefulness. By contrast, I think it’s far more plausible that ML foundation models allow us to create models that can substitute for human labor across nearly all domains at once, once a sufficient scale is reached. This would happen because sufficiently capable foundation models can be cheaply fine-tuned to provide a competent worker for nearly any task.
This is essentially the theory that there’s something like “general intelligence” which causes human labor to be so useful across a very large variety of situations compared to physical capital. This is also related to transfer learning and task-specific bottlenecks that I talked about in Part 2. Given that human population growth seems to have caused a “productivity explosion” in the past (relative to pre-agricultural rates of growth), it seems highly probable that AI could do something similar in the future if it can substitute for human labor.
That said, I’m sympathetic to the model that says future AI innovation will happen without much transfer to other domains, more similar to innovation in the past. This is one reason (among many) that my final probability distribution in the post has a long tail extending many decades into the future.
I’m also a little surprised you think that modeling when we will have systems using similar compute as the human brain is very helpful for modeling when economic growth rates will change. (Like, for sure someone should be doing it, but I’m surprised you’re concentrating on it much.) As you note, the history of automation is one of smooth adoption. And, as I think Eliezer said (roughly), there don’t seem to be many cases where new tech was predicted based on when some low-level metric would exceed the analogous metric in a biological system. The key threshold for recursive feedback loops (*especially* compute-driven ones) is how well they perform on the relevant tasks, not all tasks. And the way in which machines perform tasks usually looks very different than how biological systems do it (bird vs. airplanes, etc.).
If you think that compute is the key bottleneck/driver, then I would expect you to be strongly interested in what the automation of the semiconductor industry would look like.
In this post, when I mentioned human brain FLOP, it was mainly used as a quick estimate of AGI inference costs. However, different methodologies produce similar results (generally within 2 OOMs). A standard formula to estimate compute costs is 6*N per forward pass, where N is the number of parameters. Currently the largest language models have are estimated to be between 100 billion to 1 trillion parameters, which would work out to being 6e11 to 6e12 FLOP/forward pass.
The chinchilla scaling law suggests that inference costs will grow at about half the rate of training compute costs. If we take the estimate of 10^32 training FLOP for TAI (in 2023 algorithms) that I gave in the post, which was itself partly based on the Direct Approach, then we’d expect inference costs to grow to something like 1e15-1e16 per forward pass, although I expect subsequent algorithmic progress will bring this figure down, depending on how much algorithmic progress translates into data efficiency vs. parameter efficiency. A remaining uncertainty here is how a single forward pass for a TAI model will compare to one second of inference for humans, although I’m inclined to think that they’ll be fairly similar.
From Birds, Brains, Planes, and AI:
I listed this example in my comment, it was incorrect by an order of magnitude, and it was a retrodiction. “I didn’t look up the data on Google beforehand” does not make it a prediction.
Yeah sorry, I didn’t mean to say this directly contradicted anything you said. It just felt like a good reference that might be helpful to you or other people reading the thread. (In retrospect, I should have said that and/or linked it in response to the mention in your top-level comment instead.)
(Also, personally, I do care about how much effort and selection is required to find good retrodictions like this, so in my book “I didn’t look up the data on Google beforehand” is relevant info. But it would have been way more impressive if someone had been able to pull that off in 1890, and I agree this shouldn’t be confused for that.)
Re “it was incorrect by an order of magnitude”: that seems fine to me. If we could get that sort of precision for predicting TAI, that would be awesome and outperform any other prediction method I know about.