Peter comments on How Well Does RL Scale?

Peter 23 Oct 2025 22:43 UTC
1 point
0 ∶ 0
Maybe or maybe not—people also thought we would run out of training data years ago. But that has been pushed back and maybe won’t really matter given improvements in synthetic data, multimodal learning, and algorithmic efficiency.
- Yarrow Bouchard 🔸 23 Oct 2025 23:20 UTC
  1 point
  0 ∶ 0
  Parent
  What part do you think is uncertain? Do you think RL training could become orders of magnitude more compute efficient?
  - MattJ 29 Oct 2025 16:46 UTC
    1 point
    0 ∶ 0
    Parent
    Thank you, Toby et al., for this characteristically clear and compelling analysis and discussion. The argument that RL scaling is breathtakingly inefficient and may be hitting a hard limit is a crucial consideration for timelines.
    This post made me think about the nature of this bottleneck, and I’m curious to get the forum’s thoughts on a high-level analogy. I’m not an ML researcher, so I’m offering this with low confidence, but it seems to me there are at least two different “types” of hard problems.
    1. A Science Bottleneck (Fusion Power): Here, the barrier appears to be fundamental physics. We need to contain a plasma that is inherently unstable at temperatures hotter than the sun. Despite decades of massive investment and brilliant minds, we can’t easily change the underlying laws of physics that make this so difficult. Progress is slow, and incentives alone can’t force a breakthrough.
    2. An Engineering Bottleneck (Manhattan Project): Here, the core scientific principle was known (nuclear fission). The barrier was a set of unprecedented engineering challenges: how to enrich enough uranium, how to build a stable reactor, etc. The solution, driven by immense incentives, was a brute-force, parallel search for any viable engineering path (e.g., pursuing gaseous diffusion, electromagnetic separation, and plutonium production all at once).
    This brings me back to the RL scaling issue. I’m wondering which category this bottleneck falls into.
    From the outside, it feels more like an engineering or “Manhattan Project” problem. The core scientific discovery (the Transformer architecture, the general scaling paradigm) seems to be in place. The bottleneck Ord identifies is that one specific method (RL—likely PPO based) is significantly compute-inefficient and hard to continue scaling.
    But the massive commercial incentives at frontier labs aren’t just to make this one inefficient method 1,000 or 1,000,000x bigger. The incentive is to invent new, more efficient methods to achieve the same goal or similar.
    We’ve already seen a small-scale example of this with the rapid shift from complex RLHF to the more efficient Direct Preference Optimization (DPO). This suggests the problem may not be a fundamental “we can’t continue to improve models” barrier, but an engineering one: “this way of improving models is too expensive and unstable.”
    If this analogy holds, it seems plausible that the proprietary work at the frontier isn’t just grinding on the inefficient RL problem, but is in a “Manhattan”-style race to find a new algorithm or architecture that bypasses this specific bottleneck.
    This perspective makes me less confident that this particular bottleneck will be the one that indefinitely pushes out timelines, as it seems like exactly the kind of challenge that massive, concentrated incentives are historically good at solving.
    I could be completely mischaracterizing the nature of the challenge, though, and still feel quite uncertain. I’d be very interested to hear from those with more technical expertise if this framing seems at all relevant or if the RL bottleneck is, in fact, closer to a fundamental science or “Fusion” problem.
    - Yarrow Bouchard 🔸 31 Oct 2025 3:55 UTC
      1 point
      1 ∶ 0
      Parent
      I understand where you are coming from, and you are certainly not alone in trying to think about AI progress in terms of analogies like this. But I want to explain why I don’t think such discussions — which are common — will take us down a productive route.
      
      I don’t think anyone can predict the future based on this kind of reasoning. We can guess and speculate, but that’s it. It’s always possible, in principle, that, at any time, new knowledge could be discovered that would have a profound impact on technology. How likely is that in any particular case? Very hard to say. Nobody really knows.
      
      There are many sorts of technologies that at least some experts think should, in principle, be possible and for which, at least in theory, there is a very large incentive to create, yet still haven’t been created. Fusion is a great example, but I don’t think we can draw a clean distinction between “science” on the one hand and “engineering” on the other and say science is much harder and engineering is much easier, such that if we want to find out how hard it will be to make fundamental improvements in reinforcement learning, we just have to figure out whether it’s a “science” problem or an “engineering” problem. There’s a vast variety of science problems and engineering problems. Some science problems are much easier than some engineering problems. For example, discovering a new exoplanet and figuring out its properties, a science problem, is much easier than building a human settlement on Mars, an engineering problem.
      
      What I want to push back against in EA discourse and AGI discourse more generally is unrigorous thinking. In the harsh but funny words of the physicist David Deutsch (emphasis added):
      The search for hard-to-vary explanations is the origin of all progress. It’s the basic regulating principle of the Enlightenment. So, in science, two false approaches blight progress. One’s well-known: untestable theories. But the more important one is explanationless theories. Whenever you’re told that some existing statistical trend will continue but you aren’t given a hard-to-vary account of what causes that trend, you’re being told a wizard did it.
      If LLMs’ progress on certain tasks has improved over the last 7 years for certain specific reasons, and now we have good reasons to think those reasons for improvement won’t be there much longer going forward, then of course you can say, “Well, maybe someone will come up with some other way to keep progress going!” Maybe they will, maybe they won’t, who knows? We’ve transitioned from a rigorous argument based on evidence about LLM performance (mainly performance on benchmarks) and a causal theory of what accounts for that progress (mainly scaling of data and compute) to a hand-wavey idea about how maybe somebody will figure it out. Scaling you can track on a chart, but somebody figuring out a new idea like that is not something you can rigorously say will take 2 years or 20 years or 200 years.
      
      The unrigorous move is:
      
      There is a statistical trend that is caused by certain factors. At some point fairly soon, those factors will not be able to continue causing the statistical trend. But I want to keep extrapolating the statistical trend, so I’m going to speculate that new causal factors will appear that will keep the trend going.
      
      That doesn’t make any sense!
      
      To be fair, people do try to reach for explanations of why new causal factors will appear, usually appealing to increasing inputs to AI innovation, such as the number of AI researchers, the number of papers published, improvements in computer hardware, and the amount of financial investment. But we currently have no way of predicting how exactly the inputs to science, technology, or engineering will translate into the generation of new ideas that keep progress going. We can say more is better, but we can’t say X amount of dollars, Y amount of researchers, and Z amount of papers is enough to continue recent LLM progress. So, this is not a rigorous theory or model or explanation, but just a hand-wavey guess about what might happen. And to be clear, it might happen, but it might not, and we simply don’t know! (And I don’t think there’s any rigour or much value in the technique of trying to squeeze Bayesian blood from a stone of uncertainty by asking people to guess numbers of how probable they think something is.)
      
      So, it could take 2 years and it could take 20 years (or 200 years), and we don’t know, and we probably can’t find out any other way than just waiting and seeing. But how should we act, given that uncertainty?
      
      Well, how would we have acted if LLMs had made no progress over the last 7 years? The same argument would have applied: anyone at any time could come up with the right ideas to make AI progress go forward. Making reinforcement learning orders of magnitude more efficient is something someone could have done 7 years ago. It more or less has nothing to do with LLMs. Absent the progress in LLMs would we have thought: “oh, surely, somebody’s going to come up with a way to make RL vastly more efficient sometime soon”? Probably not, so we probably shouldn’t think that now. If the reason for thinking that is just wanting to keep extrapolating the statistical trend, that’s not a good reason.
      
      There is more investment in AI now in the capital markets, but, as I said above, that doesn’t allow us to predict anything specific. Moreover, it seems like very little of the investment is going to fundamental AI research. It seems like almost all the money is going toward expenses much more directly relating to productizing LLMs, such as incremental R&D, the compute cost of training runs, and building datacentres (plus all the other expenses related to running a large tech company).
      - MattJ 31 Oct 2025 6:12 UTC
        2 points
        0 ∶ 0
        Parent
        YB, thank you for the pushback. You’ve absolutely convinced me that my “science vs. engineering” analogy was unrigorous, and your core point about extrapolating a trend by assuming a new causal factor will appear is the correct null hypothesis to hold.
        
        What I’m still trying to reconcile, specifically regarding RL efficiency improvements, is a tension between what we can observe and what may be hidden from view.
        
        I expect Toby’s calculations are 100% correct. Your case is also rigorous and evidence-based: RL has been studied for decades, PPO (2017) was incremental, and we shouldn’t assume 10x-100x efficiency gains without evidence. The burden of proof is on those claiming breakthroughs are coming.
        But RL research seems particularly subject to information asymmetry:
        • Labs have strong incentives to keep RL improvements proprietary (competitive advantage in RLHF, o1-style reasoning, agent training)
        • Negative results rarely get published (we don’t know what hasn’t worked)
        • The gap between “internal experiments” and “public disclosure” may be especially long for RL
        We’ve seen this pattern before—AlphaGo’s multi-year information lag, GPT-4’s ~7-month gap. But for RL specifically, the opacity seems greater. OpenAI uses RL for o1, but we don’t know their techniques, efficiency gains, or scaling properties. DeepMind’s work on RL is similarly opaque.
        
        This leaves me uncertain about future RL scaling specifically. On one hand, you’re right that decades of research suggest efficiency improvements are hard. On the other hand, recent factors (LLMs as reward models, verifiable domains for self-play, unprecedented compute for experiments) combined with information asymmetry make me wonder if we’re reasoning from incomplete data.
        
        The specific question: Does the combination of (a) new factors like LLMs/verifiable domains, plus (b) the opacity and volume of RL research at frontier labs, warrant updating our priors on RL efficiency? Or is this still the same “hand-waving” trap—just assuming hidden progress exists because we expect the trend to continue?
        
        On the action-relevant side: if RL efficiency improvements would enable significantly more capable agents or self-improvement, should safety researchers prepare for that scenario despite epistemic uncertainty? The lead times for safety work seem long enough that “wait and see” may not be viable.
        For falsifiability: we should know within 18-24 months. If RL-based systems (agents, reasoners) don’t show substantial capability gains despite continued investment, that would validate skepticism. If they do, it would suggest there were efficiency improvements we couldn’t see from outside.
        
        I’m genuinely uncertain here and would value a better sense of whether the information asymmetry around RL research specifically changes how we should weigh the available evidence?
        Yarrow Bouchard 🔸 1 Nov 2025 12:56 UTC
        1 point
        0 ∶ 0
        Parent
        I guess there could have recently been a major breakthrough in RL at any of the major AI companies that the public doesn’t know about yet. Or there could be one soon that we wouldn’t know about right away. But why think that is the case? And why think that is more like at this particular point in time than any other time within the last 10 years or so?
        Can you explain what “LLMs as reward models” and “verifiable domains for self-play” mean and why these would make RL dramatically more compute efficient? I’m guessing that “LLMs as reward models” means that the representational power of LLMs is far greater than for RL agents in the past. But hasn’t RLHF been used on LLMs since before the first version of ChatGPT? So wouldn’t our idea of how quickly LLMs learn or improve using RL from the past 3 years or so already account for LLMs as world models?
        By “verifiable domains for self-play”, do you we have benchmarks or environments that automatically gradable and can provide a reward signal without a human manually taking any action? If so, again, that seems like something that should already be accounted for in the last 3 years or so of data.
        If what you’re saying is that LLMs as reward models or verifiable domains for self-play could contribute to research or innovation in RL such that a major breakthrough in RL compute efficiency is more likely, I don’t follow the reasoning there.
        You also mentioned “unprecedented compute for experiments”, which maybe could be a factor that will contribute to the likelihood of such a breakthrough, but who knows. Why couldn’t you test an idea for more compute efficient RL on a small number of GPUs first and see if you get early results? Why would having a lot more GPUs help? With a lot of GPUs, you could test more ideas in parallel, but is the limiting factor really the ability to test ideas or is it coming up with new ideas in the first place?
        MattJ 3 Nov 2025 1:38 UTC
        2 points
        0 ∶ 0
        Parent
        Yarrow, these are fantastic, sharp questions. Your “already accounted for” point is the strongest counter-argument I’ve encountered.
        You’re correct in your interpretation of the terms. And your core challenge—if LLM reward models and verifiable domains have existed for ~3 years, shouldn’t their impact already be visible?—is exactly what I’m grappling with.
        Let me try to articulate my hypothesis more precisely:
        The Phase 1 vs Phase 2 distinction:
        I wonder if we’re potentially conflating two different uses of RL that might have very different efficiency profiles:
        1. Phase 1 (Alignment/Style): This is the RLHF that created ChatGPT—steering a pretrained model to be helpful/harmless. This has been done for ~3 years and is probably what’s reflected in public benchmark data.
        2. Phase 2 (Capability Gains): This is using RL to make models fundamentally more capable at tasks through extended reasoning or self-play (e.g., o1, AlphaGo-style approaches).
        My uncertainty is: could “Phase 2” RL have very different efficiency characteristics than “Phase 1”?
        Recent academic evidence:
        Some very recent papers seem to directly address this question:
        • A paper by Khatri et al., “The Art of Scaling Reinforcement Learning Compute for LLMs” (arXiv: 2510.13786), appears to show that simple RL methods do hit hard performance ceilings (validating your skepticism), but that scaling RL is a complex “art.” It suggests a specific recipe (ScaleRL) can achieve predictable scaling. This hints the bottleneck might be “know-how” rather than a fundamental limit.
        • Another paper by Tan et al., “Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning” (arXiv: 2509.25300), on scaling RL for math found that performance is more bound by data quality (like from verifiable domains) than just compute, and that larger models are more compute- and sample-efficient at these tasks.
        Why this seems relevant:
        This research suggests “Phase 1” RL (simple, public methods) and “Phase 2” RL (complex recipes, high-quality data, large models) might have quite different scaling properties.
        This makes me wonder if the scaling properties from prior RL research might not fully capture what’s possible in this new regime: very large models + high-quality verifiable domains + substantial compute + the right training recipe. Prior research isn’t irrelevant, but perhaps extrapolation from it is unreliable when the conditions are changing this much?
        If labs have found (or are close to finding) these “secret recipes” for scalable RL, that could explain continued capital investment from well-informed actors despite public data showing plateaus.
        The action-relevant dilemma:
        Even granting the epistemic uncertainty, there seems to be a strategic question: Given long lead times for safety research, should researchers hedge by preparing for RL efficiency improvements, even if we can’t confidently predict them?
        The asymmetry: if we wait for public evidence before starting safety work, and RL does become substantially more efficient (because a lab finds the right “recipe”), we’ll have even less lead time. But if we prepare unnecessarily, we’ve misallocated resources.
        I don’t have a clean answer to what probability threshold for a potential breakthrough justifies heightened precautionary work. But the epistemic uncertainty itself—combined with some papers suggesting the scaling regime might be fundamentally different than assumed—makes me worry whether we’re evaluating the efficiency of propellers while jet engines are being invented in private.
        Does this change your analysis at all, or do you think the burden of proof still requires more than theoretical papers about potential scaling regimes?
        
        Yarrow Bouchard 🔸 4 Nov 2025 14:19 UTC
        1 point
        0 ∶ 0
        Parent
        Thank you for your kindness. I appreciate it. :)
        
        Do the two papers you mentioned give specific quantitative information about how much LLM performance increases as the compute used for RL scales? And is it a substantially more efficient scaling than what Toby Ord assumes in the post above?
        
        In terms of AI safety research, this is getting into a very broad, abstract, general, philosophical point, but, personally, I’m fairly skeptical of the idea that anybody today will be able to do AI safety research now that can be applied to much more powerful, much more general AI systems in the future. I guess if you think the more powerful, more general AI systems of the future will just be bigger versions of the type of systems we have today, then it makes sense why you’d think AI safety research would be useful now. But I think there are good reasons for doubting that, and LLM scaling running out of steam is just one of those good reasons.
        
        To take a historical example, the Machine Intelligence Research Institute (MIRI) had some very specific ideas about AI safety and alignment dating back to before the deep learning revolution that started around 2012. I recall having an exchange with Eliezer Yudkowsky, who co-founded MIRI and does research there, on Facebook sometime around 2015-2017 where he expressed doubt that deep learning was the way to get to AGI and said his best bet was that ~~symbolic AI~~ ~~was the most promising approach. At some point, he must have changed his mind, but I can’t find any writing he’s done or any talk or interview where he explains when and why his thinking changed.~~
        
        [Edited on 2026-01-18 at 20:55 UTC to add: I misremembered some important details about my exchanges on Facebook with Eliezer Yudkowsky and another person at MIRI, Rob Bensinger, about deep learning and other AI paradigms around 2016-2018. Take my struckthrough recollections above as unreliable memory. I went through the trouble of digging up some old Facebook comments and detailed what I found here.]
        
        In any case, one criticism — which I agree with — that has been made of Yudkowsky’s and MIRI’s current ideas about AI safety and alignment is that these ideas have not been updated in the last 13 years, and remain the same ideas that Yudkowsky and MIRI were advocating before the deep learning revolution. And there are strong reasons to doubt they still apply to frontier AI systems, if they ever did. What we would expect from Yudkowsky and MIRI at this point is either an updating of their ideas about safety and alignment, or an explanation of why their ideas developed with symbolic AI in mind should still apply, without modification, to deep learning-based systems. It’s hard to understand why this point hasn’t been addressed, particularly since people have been bringing it up for years. It comes across, in the words of one critic, as a sign of thinkers who are “persistently unable to update their priors.”
        
        What I just said about MIRI’s views on AI safety and alignment could be applied to AI safety more generally. Ideas developed on the assumption that current techniques, architectures, designs, or paradigms will scale all the way to AGI could turn out to be completely useless and irrelevant if it turns out that more powerful and more general AI systems will be built using entirely novel ideas that we can’t anticipate yet. You used an aviation analogy. Let me try my own. Research on AI safety that assumes LLMs will scale to AGI and is therefore based on studying the properties peculiar to LLMs might turn out to be a waste of time if technology goes in another direction, just as aviation safety research that assumed airships would be the technology that will underlie air travel and focused on the properties of hydrogen and helium gas has no relevance to a world where air travel is powered by airplanes that are heavier than air.
        
        It’s relevant to bring up at this point that a survey of AI experts found that 76% of them think that it’s unlikely or very unlikely that current AI techniques, such as LLMs, will scale to AGI. There are many reasons to agree with the majority of experts on this question, some of which I briefly listed in a post here.
        
        Because I don’t see scaling up LLMs as a viable path to AGI, I personally don’t see much value in AI safety research that assumes that it is a viable path. (To be clear, AI safety research that is about things like how LLM-based chatbots can safely respond to users who express suicidal ideation, and not be prompted into saying something harmful or dangerous, could potentially be very valuable, but that’s about present-day use cases of LLMs and not about AGI or global catastrophic risk, which is what we’ve been talking about.) In general, I’m very sympathetic to a precautionary, “better safe than sorry” approach, but, to me, AI safety or alignment research can’t even be justified on those grounds. The chance of LLMs scaling up to AGI seems so remote.
        
        It’s also unlike the remote chance of an asteroid strike, where we have hard science that can be used to calculate that probability rigorously. It’s more like the remote chance that the Large Hadron Collider (LHC) would create a black hole, which can only be assigned a probability above zero because of fundamental epistemic uncertainty, i.e., based on the chance that we’ve gotten the laws of physics wrong. I don’t know if I can quite put my finger on why I don’t like a form of argument in favour of practical measures to mitigate existential risk based on fundamental epistemic uncertainty. I can point out that it would seem to lead to have some very bizarre implications.
        
        For example, what probability do we assign to the possibility that Christian fundamentalism is correct? If we assign a probability above zero, then this leads us literally to Pascal’s wager, because the utility of heaven is infinite, the disutility of hell is infinite, and the cost of complying with the Christian fundamentalist requirements for going to heaven are not only finite but relatively modest. Reductio ad absurdum?
        
        By contrast, we know for sure dangerous asteroids are out there, we know they’ve hit Earth before, and we have rigorous techniques for observing them, tracking them, and predicting their trajectories. When NASA says there’s a 1 in 10,000 chance of an asteroid hitting Earth, that’s an entirely different kind of a probability than if a Bayesian-utilitarian guesses there’s a 1 in 10,000 chance that Christian fundamentalism is correct, that the LHC will create a black hole, or that LLMs will scale to AGI within two decades.
        
        One way I can try to articulate my dissatisfaction with the argument that we should do AI safety research anyway, just in case, is to point out there’s no self-evident or completely neutral or agnostic perspective from which to work on AGI safety. For example, what if the first AGIs we build would otherwise have been safe, aligned, and friendly, but by applying our alignment techniques developed from AI safety research, we actually make them incredibly dangerous and cause a global catastrophe? How do we know which kind of action is actually precautionary?
        
        I could also make the point that, in some very real and practical sense, all AI research is a tradeoff between other kinds of AI research that could have been done instead. So, maybe instead of focusing on LLMs, it’s wiser to focus on alternative ideas like energy-based models, program synthesis, neuromorphic AI, or fundamental RL research. I think the approach of trying to squeeze Bayesian blood from a stone of uncertainty by making subjective guesses of probabilities can only take you so far, and pretty quickly the limitations become apparent.
        
        To fully make myself clear and put my cards completely on the table, I don’t find effective altruism’s treatment of the topic of near-term AGI to be particularly intellectually rigorous or persuasive, and I suspect at least some people in EA who currently think very near-term AGI is very likely will experience a wave of doubt when the AI investment bubble pops sometime within the next few years. There is no external event, no evidence, and no argument that can compel someone to update their views if they’re inclined enough to resist updating, but I suspect there are some people in EA who will interpret the AI bubble popping as new information and will take it as an opportunity to think carefully about their views on near-term AGI.
        
        But if you think that very near-term AGI is very likely, and if you think LLMs very likely will scale to AGI, then this implies an entirely different idea about what should be done, practically, in the area of AI safety research today, and if you’re sticking to those assumptions, then I’m the wrong person to ask about what should be done.
        MattJ 5 Nov 2025 1:12 UTC
        1 point
        0 ∶ 0
        Parent
        Yarrow, thank you for this sharp and clarifying discussion.
        You have completely convinced me that my earlier arguments from “investment as a signal” or “LHC/Pascal’s Wager” were unrigorous, and I concede those points.
        I think I can now articulate my one, non-speculative crux.
        The “so what” of Toby Ord’s (excellent) analysis is that it provides a perfect, rigorous, “hindsight” view of the last paradigm—what I’ve been calling “Phase 1” RL for alignment.
        My core uncertainty isn’t speculative “what-if” hope. It’s that the empirical ground is shifting.
        The very recent papers we discussed (Khatri et al. on the “art” of scaling, and Tan et al. on math reasoning) are, for me, the first public, rigorous evidence for a “Phase 2″ capability paradigm.
        • They provide a causal mechanism for why the old, simple scaling data may be an unreliable predictor.
        • They show this “Phase 2” regime is different: it’s not a simple power law but a complex, recipe-dependent “know-how” problem (Khatri), and it has different efficiency dynamics (Tan).
        This, for me, is the action-relevant dilemma.
        We are no longer in a state of “pure speculation”. We are in a state of grounded, empirical uncertainty where the public research is just now documenting a new, more complex scaling regime that the private labs have been pursuing in secret.
        Given that the lead time for any serious safety work is measured in years, and the nature of the breakthrough is a proprietary, secret “recipe,” the “wait for public proof” strategy seems non-robust.
        That’s the core of my concern. I’m now much clearer on the crux of the argument, and I can’t thank you enough for pushing me to be more rigorous. This has been incredibly helpful, and I’ll leave it there.
        Yarrow Bouchard 🔸 15 Nov 2025 3:06 UTC
        2 points
        0 ∶ 0
        Parent
        Hello, Matt. Let me just say I really appreciate your friendly, supportive, and positive approach to this conversation. It’s very nice. Discussions on the EA Forum can get pretty sour sometimes, and I’m probably not entirely blameless in that myself.
        
        You don’t have to reply if you don’t want, but I just wanted to follow up in case you still did.
        
        Can you explain what you mean about the data efficiency of the new RL techniques in the papers you mentioned? You say it’s more complex, but that doesn’t help me understand.
        
        By the way, did you use an LLM like Claude or ChatGPT to help write your comment? It has some of the hallmarks of LLM writing for me. I’m just saying this to help you — you may not realize how much LLMs’ writing style sticks out like a sore thumb (depending on how you use them) and it will likely discourage people from engaging with you if they detect that. I keep encouraging people to trust themselves as writers, trust their own voice, and reassuring them that the imperfections of their writing doesn’t make us, the readers, like it less, it makes us like it more.