Hey Aaron, thanks for your thorough comment. While we still disagree (explained a bit below), I’m also quite glad to read your comment :)
Re scaling current methods: The hundreds of billions figure we quoted does require more context not in our piece; SemiAnalysis explains in a bit more detail how they get to that number (eg assuming training in 3mo instead of 2 years). We don’t want to haggle over the exact scale before it becomes infeasible, though—even if we get another 2 OOM in, we wanted to emphasize with our argument that ‘the current method route’ 1) requires regular scientific breakthroughs of the pre-TAI sort, and 2) even if we get there doesn’t guarantee capabilities that look like magic compared to what we have now, depending on how much you believe in emergence. Both would be bottlenecks. We’re pretty sure that current capabilities can be economically useful with more people, more fine-tuning. Just skeptical of the sudden emergence of the exact capabilities we need for transformative growth.
On Epoch’s work on algorithmic progress specifically, we think it’s important to note that:
1) They do this by measuring progress on computer vision benchmarks, which isn’t a good indicator of progress in either algorithms for control (physical world important for TAI) or even language—it might be cheeky to say, little algorithmic progress there; just scale ;) Computer vision is also the exact example Schaeffer et al. gives for the subfield where emergent abilities do not arise—until you induce them by intentionally crafting the evaluations.
2) That there even is a well-defined benchmark is a good sign for beating that benchmark. AI benefits from quantifiable evaluation (beating a world champion, CASP scores) when it measures what we want. But we’d say for really powerful AI we don’t know what we want (see our wrong direction / philosophy hurdle), plus at some point the quantifiable metrics we do have stop measuring what we really want. (Is there really a difference between models that get 91.0 and 91.1 top-1 accuracy on ImageNet? Do people really look at MMLU over qualitative experience when they choose which language model to play with?)
3) We don’t discount algorithmic progress at all! In fact we cite SemiAnalysis and the Epoch team’s suggestions on where to research next. But again, these require human breakthroughs, bottlenecked on human research timescales—we don’t have a step by step progress we can just follow to improve a metric to TAI, so hard-won past breakthroughs doesn’t guarantee future ones happen at the same clip.
Re Constitutional AI: We agree that researchers will continue searching for ways to use human feedback more efficiently. But under our Baumol framework, the important step is going from one to zero, not n to one. And there we find it hard to believe that in high stakes situations (say, judging AI debates), that safety researchers are willing to hand over the reins. We’d also really contest the ‘perform very similarly to human raters’ is enough—it’d be surprising if we already have a free lunch, no information lost, way to simulate humans well enough to make better AI.
Re 2025 language models equipped with search: For this to be as useful as a panel of experts, the models need to be searching an index where what the experts know is recorded, in some sense, which 1) doesn’t happen (experts are busy being experts) 2) is sometimes impossible (chef, LeBron) 3) maybe less likely in the future when an LLM is going to just hoover up your hard won expertise? I know you mentioned you don’t disagree with our point here though.
Re motte and bailey: We agree that our hurdles may have overlap. But the point of our Baumol framework is that any valid hurdle, where we don’t know if it’s fundamentally the same problem that causes other hurdles, each has the potential to bottleneck transformative growth. And we allude to several cases where for one reason or another a promising invention did not meet expectations precisely because they could not clear them all.
Hope this clarifies our view. Not conclusive, of course, we’re happy, like your piece, to also be going for intuition pumps to temper expectations.
Thanks for your response. I’ll just respond to a couple things.
Re Constitutional AI: I agree normatively that it seems bad to hand over judging AI debates to AIs[1]. I also think this will happen. To quote from the original AI Safety via Debate paper,
Human time is expensive: We may lack enough human time to judge every debate, which we can address by training ML models to predict human reward as in Christiano et al. [2017]. Most debates can be judged by the reward predictor rather than by the humans themselves. Critically, the reward predictors do not need to be as smart as the agents by our assumption that judging debates is easier than debating, so they can be trained with less data. We can measure how closely a reward predictor matches a human by showing the same debate to both.
Re
We’d also really contest the ‘perform very similarly to human raters’ is enough—it’d be surprising if we already have a free lunch, no information lost, way to simulate humans well enough to make better AI.
I also find this surprising, or at least I did the first 3 times I came across medium-quality evidence pointing this direction. I don’t find it as surprising any more because I’ve updated my understanding of the world to “welp, I guess 2023 AIs actually are that good on some tasks.” Rather than making arguments to try and convince you, I’ll just link some of the evidence that I have found compelling, maybe you will too, maybe not: Model Written Evals, MACHIAVELLI benchmark, Alpaca (maybe the most significant for my thinking), this database, Constitutional AI.
I’m far from certain that this trend, of LLMs being useful for making better LLMs and for replacing human feedback, continues rather than hitting a wall in the next 2 years, but it does seem more likely than not to me, based on my read of the evidence. Some important decisions in my life rely on how soon this AI stuff is happening (for instance if we have 20+ years I should probably aim to do policy work), so I’m pretty interested in having correct views. Currently, LLMs improving the next generation of AIs via more and better training data is one of the key factors in how I’m thinking about this. If you don’t find these particular evidences compelling and are able to explain why, that would be useful to me!
I’m actually unsure here. I expect there are some times where it’s fine to have no humans in the loop and other times where it’s critical. It generally gives me the ick to take humans out of the loop, but I expect there are some times where I would think it’s correct.
Makes sense that this would be a big factor in what to do with our time, and AI timelines. And we’re surprised too by how AI can overperform expectations, like in the sources you cited.
We’d still say the best way of characterizing the problem of creating synthetic data is that it’s a wide open problem, rather than high confidence that naive approaches using current LMs will just work. How about a general intuition instead of parsing individual sources. We wouldn’t expect making the dataset bigger by just repeating the same example over and over to work. We generate data by having ‘models’ of the original data generators, humans. If we knew what exactly made human data ‘good,’ we could optimize directly for it and simplify massively (this runs into the well-defined eval problem again—we can craft datasets to beat benchmarks of course).
An analogy (a disputed one, to be fair) is Ted Chiang’s lossy compression. So for every case of synthetic data working, there’s also cases where it fails, like Shumailov et el. we cited. If we knew exactly what made human data ‘good,’ we’d argue you wouldn’t see labs continue to ramp up hiring contractors specifically to generate high-quality data in expert domains, like programming.
A fun exercise—take a very small open-source dataset, train your own very small LM, and have it augment (double!) its own dataset. Try different prompts, plot n-gram distributions vs the original data. Can you get one behavior out of the next generation that looks like magic compared to the previous, or does improvement plateau? May have nitpicks with this experiment, but I don’t think it’s that different to what’s happening at large scale.
Re scaling current methods: The hundreds of billions figure we quoted does require more context not in our piece; SemiAnalysis explains in a bit more detail how they get to that number (eg assuming training in 3mo instead of 2 years).
That’s hundreds of billions with current hardware. (Actually, not even current hardware, but the A100 which is last-gen; the H100 should already do substantially better.) But HW price-performance currently doubles every ~2 years. Yes, Moore’s Law may be slowing, but I’d be surprised if we don’t get another OOM improvement in price-performance during the next decade, especially given the insatiable demand for effective compute these days.
We don’t want to haggle over the exact scale before it becomes infeasible, though—even if we get another 2 OOM in, we wanted to emphasize with our argument that ‘the current method route’ 1) requires regular scientific breakthroughs of the pre-TAI sort, and 2) even if we get there doesn’t guarantee capabilities that look like magic compared to what we have now, depending on how much you believe in emergence. Both would be bottlenecks.
Yeah, I agree things would be a lot slower without algorithmic breakthroughs. Those do seem to be happening at a pretty good pace though (not just looking at ImageNet, but also looking at ML research subjectively). I’d assume they’ll keep happening at the same rate so long as the number of people (and later, possibly AIs) focused on finding them keeps growing at the same rate.
Hey Aaron, thanks for your thorough comment. While we still disagree (explained a bit below), I’m also quite glad to read your comment :)
Re scaling current methods: The hundreds of billions figure we quoted does require more context not in our piece; SemiAnalysis explains in a bit more detail how they get to that number (eg assuming training in 3mo instead of 2 years). We don’t want to haggle over the exact scale before it becomes infeasible, though—even if we get another 2 OOM in, we wanted to emphasize with our argument that ‘the current method route’ 1) requires regular scientific breakthroughs of the pre-TAI sort, and 2) even if we get there doesn’t guarantee capabilities that look like magic compared to what we have now, depending on how much you believe in emergence. Both would be bottlenecks. We’re pretty sure that current capabilities can be economically useful with more people, more fine-tuning. Just skeptical of the sudden emergence of the exact capabilities we need for transformative growth.
On Epoch’s work on algorithmic progress specifically, we think it’s important to note that:
1) They do this by measuring progress on computer vision benchmarks, which isn’t a good indicator of progress in either algorithms for control (physical world important for TAI) or even language—it might be cheeky to say, little algorithmic progress there; just scale ;) Computer vision is also the exact example Schaeffer et al. gives for the subfield where emergent abilities do not arise—until you induce them by intentionally crafting the evaluations.
2) That there even is a well-defined benchmark is a good sign for beating that benchmark. AI benefits from quantifiable evaluation (beating a world champion, CASP scores) when it measures what we want. But we’d say for really powerful AI we don’t know what we want (see our wrong direction / philosophy hurdle), plus at some point the quantifiable metrics we do have stop measuring what we really want. (Is there really a difference between models that get 91.0 and 91.1 top-1 accuracy on ImageNet? Do people really look at MMLU over qualitative experience when they choose which language model to play with?)
3) We don’t discount algorithmic progress at all! In fact we cite SemiAnalysis and the Epoch team’s suggestions on where to research next. But again, these require human breakthroughs, bottlenecked on human research timescales—we don’t have a step by step progress we can just follow to improve a metric to TAI, so hard-won past breakthroughs doesn’t guarantee future ones happen at the same clip.
Re Constitutional AI: We agree that researchers will continue searching for ways to use human feedback more efficiently. But under our Baumol framework, the important step is going from one to zero, not n to one. And there we find it hard to believe that in high stakes situations (say, judging AI debates), that safety researchers are willing to hand over the reins. We’d also really contest the ‘perform very similarly to human raters’ is enough—it’d be surprising if we already have a free lunch, no information lost, way to simulate humans well enough to make better AI.
Re 2025 language models equipped with search: For this to be as useful as a panel of experts, the models need to be searching an index where what the experts know is recorded, in some sense, which 1) doesn’t happen (experts are busy being experts) 2) is sometimes impossible (chef, LeBron) 3) maybe less likely in the future when an LLM is going to just hoover up your hard won expertise? I know you mentioned you don’t disagree with our point here though.
Re motte and bailey: We agree that our hurdles may have overlap. But the point of our Baumol framework is that any valid hurdle, where we don’t know if it’s fundamentally the same problem that causes other hurdles, each has the potential to bottleneck transformative growth. And we allude to several cases where for one reason or another a promising invention did not meet expectations precisely because they could not clear them all.
Hope this clarifies our view. Not conclusive, of course, we’re happy, like your piece, to also be going for intuition pumps to temper expectations.
Thanks for your response. I’ll just respond to a couple things.
Re Constitutional AI: I agree normatively that it seems bad to hand over judging AI debates to AIs[1]. I also think this will happen. To quote from the original AI Safety via Debate paper,
Re
I also find this surprising, or at least I did the first 3 times I came across medium-quality evidence pointing this direction. I don’t find it as surprising any more because I’ve updated my understanding of the world to “welp, I guess 2023 AIs actually are that good on some tasks.” Rather than making arguments to try and convince you, I’ll just link some of the evidence that I have found compelling, maybe you will too, maybe not: Model Written Evals, MACHIAVELLI benchmark, Alpaca (maybe the most significant for my thinking), this database, Constitutional AI.
I’m far from certain that this trend, of LLMs being useful for making better LLMs and for replacing human feedback, continues rather than hitting a wall in the next 2 years, but it does seem more likely than not to me, based on my read of the evidence. Some important decisions in my life rely on how soon this AI stuff is happening (for instance if we have 20+ years I should probably aim to do policy work), so I’m pretty interested in having correct views. Currently, LLMs improving the next generation of AIs via more and better training data is one of the key factors in how I’m thinking about this. If you don’t find these particular evidences compelling and are able to explain why, that would be useful to me!
I’m actually unsure here. I expect there are some times where it’s fine to have no humans in the loop and other times where it’s critical. It generally gives me the ick to take humans out of the loop, but I expect there are some times where I would think it’s correct.
Makes sense that this would be a big factor in what to do with our time, and AI timelines. And we’re surprised too by how AI can overperform expectations, like in the sources you cited.
We’d still say the best way of characterizing the problem of creating synthetic data is that it’s a wide open problem, rather than high confidence that naive approaches using current LMs will just work. How about a general intuition instead of parsing individual sources. We wouldn’t expect making the dataset bigger by just repeating the same example over and over to work. We generate data by having ‘models’ of the original data generators, humans. If we knew what exactly made human data ‘good,’ we could optimize directly for it and simplify massively (this runs into the well-defined eval problem again—we can craft datasets to beat benchmarks of course).
An analogy (a disputed one, to be fair) is Ted Chiang’s lossy compression. So for every case of synthetic data working, there’s also cases where it fails, like Shumailov et el. we cited. If we knew exactly what made human data ‘good,’ we’d argue you wouldn’t see labs continue to ramp up hiring contractors specifically to generate high-quality data in expert domains, like programming.
A fun exercise—take a very small open-source dataset, train your own very small LM, and have it augment (double!) its own dataset. Try different prompts, plot n-gram distributions vs the original data. Can you get one behavior out of the next generation that looks like magic compared to the previous, or does improvement plateau? May have nitpicks with this experiment, but I don’t think it’s that different to what’s happening at large scale.
That’s hundreds of billions with current hardware. (Actually, not even current hardware, but the A100 which is last-gen; the H100 should already do substantially better.) But HW price-performance currently doubles every ~2 years. Yes, Moore’s Law may be slowing, but I’d be surprised if we don’t get another OOM improvement in price-performance during the next decade, especially given the insatiable demand for effective compute these days.
Yeah, I agree things would be a lot slower without algorithmic breakthroughs. Those do seem to be happening at a pretty good pace though (not just looking at ImageNet, but also looking at ML research subjectively). I’d assume they’ll keep happening at the same rate so long as the number of people (and later, possibly AIs) focused on finding them keeps growing at the same rate.