Yarrow Bouchard 🔸 comments on How Well Does RL Scale?

Yarrow Bouchard 🔸 1 Nov 2025 12:56 UTC
1 point
0 ∶ 0
I guess there could have recently been a major breakthrough in RL at any of the major AI companies that the public doesn’t know about yet. Or there could be one soon that we wouldn’t know about right away. But why think that is the case? And why think that is more like at this particular point in time than any other time within the last 10 years or so?
Can you explain what “LLMs as reward models” and “verifiable domains for self-play” mean and why these would make RL dramatically more compute efficient? I’m guessing that “LLMs as reward models” means that the representational power of LLMs is far greater than for RL agents in the past. But hasn’t RLHF been used on LLMs since before the first version of ChatGPT? So wouldn’t our idea of how quickly LLMs learn or improve using RL from the past 3 years or so already account for LLMs as world models?
By “verifiable domains for self-play”, do you we have benchmarks or environments that automatically gradable and can provide a reward signal without a human manually taking any action? If so, again, that seems like something that should already be accounted for in the last 3 years or so of data.
If what you’re saying is that LLMs as reward models or verifiable domains for self-play could contribute to research or innovation in RL such that a major breakthrough in RL compute efficiency is more likely, I don’t follow the reasoning there.
You also mentioned “unprecedented compute for experiments”, which maybe could be a factor that will contribute to the likelihood of such a breakthrough, but who knows. Why couldn’t you test an idea for more compute efficient RL on a small number of GPUs first and see if you get early results? Why would having a lot more GPUs help? With a lot of GPUs, you could test more ideas in parallel, but is the limiting factor really the ability to test ideas or is it coming up with new ideas in the first place?
- MattJ 3 Nov 2025 1:38 UTC
  2 points
  0 ∶ 0
  Parent
  Yarrow, these are fantastic, sharp questions. Your “already accounted for” point is the strongest counter-argument I’ve encountered.
  You’re correct in your interpretation of the terms. And your core challenge—if LLM reward models and verifiable domains have existed for ~3 years, shouldn’t their impact already be visible?—is exactly what I’m grappling with.
  Let me try to articulate my hypothesis more precisely:
  The Phase 1 vs Phase 2 distinction:
  I wonder if we’re potentially conflating two different uses of RL that might have very different efficiency profiles:
  1. Phase 1 (Alignment/Style): This is the RLHF that created ChatGPT—steering a pretrained model to be helpful/harmless. This has been done for ~3 years and is probably what’s reflected in public benchmark data.
  2. Phase 2 (Capability Gains): This is using RL to make models fundamentally more capable at tasks through extended reasoning or self-play (e.g., o1, AlphaGo-style approaches).
  My uncertainty is: could “Phase 2” RL have very different efficiency characteristics than “Phase 1”?
  Recent academic evidence:
  Some very recent papers seem to directly address this question:
  • A paper by Khatri et al., “The Art of Scaling Reinforcement Learning Compute for LLMs” (arXiv: 2510.13786), appears to show that simple RL methods do hit hard performance ceilings (validating your skepticism), but that scaling RL is a complex “art.” It suggests a specific recipe (ScaleRL) can achieve predictable scaling. This hints the bottleneck might be “know-how” rather than a fundamental limit.
  • Another paper by Tan et al., “Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning” (arXiv: 2509.25300), on scaling RL for math found that performance is more bound by data quality (like from verifiable domains) than just compute, and that larger models are more compute- and sample-efficient at these tasks.
  Why this seems relevant:
  This research suggests “Phase 1” RL (simple, public methods) and “Phase 2” RL (complex recipes, high-quality data, large models) might have quite different scaling properties.
  This makes me wonder if the scaling properties from prior RL research might not fully capture what’s possible in this new regime: very large models + high-quality verifiable domains + substantial compute + the right training recipe. Prior research isn’t irrelevant, but perhaps extrapolation from it is unreliable when the conditions are changing this much?
  If labs have found (or are close to finding) these “secret recipes” for scalable RL, that could explain continued capital investment from well-informed actors despite public data showing plateaus.
  The action-relevant dilemma:
  Even granting the epistemic uncertainty, there seems to be a strategic question: Given long lead times for safety research, should researchers hedge by preparing for RL efficiency improvements, even if we can’t confidently predict them?
  The asymmetry: if we wait for public evidence before starting safety work, and RL does become substantially more efficient (because a lab finds the right “recipe”), we’ll have even less lead time. But if we prepare unnecessarily, we’ve misallocated resources.
  I don’t have a clean answer to what probability threshold for a potential breakthrough justifies heightened precautionary work. But the epistemic uncertainty itself—combined with some papers suggesting the scaling regime might be fundamentally different than assumed—makes me worry whether we’re evaluating the efficiency of propellers while jet engines are being invented in private.
  Does this change your analysis at all, or do you think the burden of proof still requires more than theoretical papers about potential scaling regimes?
  - Yarrow Bouchard 🔸 4 Nov 2025 14:19 UTC
    1 point
    0 ∶ 0
    Parent
    Thank you for your kindness. I appreciate it. :)
    
    Do the two papers you mentioned give specific quantitative information about how much LLM performance increases as the compute used for RL scales? And is it a substantially more efficient scaling than what Toby Ord assumes in the post above?
    
    In terms of AI safety research, this is getting into a very broad, abstract, general, philosophical point, but, personally, I’m fairly skeptical of the idea that anybody today will be able to do AI safety research now that can be applied to much more powerful, much more general AI systems in the future. I guess if you think the more powerful, more general AI systems of the future will just be bigger versions of the type of systems we have today, then it makes sense why you’d think AI safety research would be useful now. But I think there are good reasons for doubting that, and LLM scaling running out of steam is just one of those good reasons.
    
    To take a historical example, the Machine Intelligence Research Institute (MIRI) had some very specific ideas about AI safety and alignment dating back to before the deep learning revolution that started around 2012. I recall having an exchange with Eliezer Yudkowsky, who co-founded MIRI and does research there, on Facebook sometime around 2015-2017 where he expressed doubt that deep learning was the way to get to AGI and said his best bet was that ~~symbolic AI~~ ~~was the most promising approach. At some point, he must have changed his mind, but I can’t find any writing he’s done or any talk or interview where he explains when and why his thinking changed.~~
    
    [Edited on 2026-01-18 at 20:55 UTC to add: I misremembered some important details about my exchanges on Facebook with Eliezer Yudkowsky and another person at MIRI, Rob Bensinger, about deep learning and other AI paradigms around 2016-2018. Take my struckthrough recollections above as unreliable memory. I went through the trouble of digging up some old Facebook comments and detailed what I found here.]
    
    In any case, one criticism — which I agree with — that has been made of Yudkowsky’s and MIRI’s current ideas about AI safety and alignment is that these ideas have not been updated in the last 13 years, and remain the same ideas that Yudkowsky and MIRI were advocating before the deep learning revolution. And there are strong reasons to doubt they still apply to frontier AI systems, if they ever did. What we would expect from Yudkowsky and MIRI at this point is either an updating of their ideas about safety and alignment, or an explanation of why their ideas developed with symbolic AI in mind should still apply, without modification, to deep learning-based systems. It’s hard to understand why this point hasn’t been addressed, particularly since people have been bringing it up for years. It comes across, in the words of one critic, as a sign of thinkers who are “persistently unable to update their priors.”
    
    What I just said about MIRI’s views on AI safety and alignment could be applied to AI safety more generally. Ideas developed on the assumption that current techniques, architectures, designs, or paradigms will scale all the way to AGI could turn out to be completely useless and irrelevant if it turns out that more powerful and more general AI systems will be built using entirely novel ideas that we can’t anticipate yet. You used an aviation analogy. Let me try my own. Research on AI safety that assumes LLMs will scale to AGI and is therefore based on studying the properties peculiar to LLMs might turn out to be a waste of time if technology goes in another direction, just as aviation safety research that assumed airships would be the technology that will underlie air travel and focused on the properties of hydrogen and helium gas has no relevance to a world where air travel is powered by airplanes that are heavier than air.
    
    It’s relevant to bring up at this point that a survey of AI experts found that 76% of them think that it’s unlikely or very unlikely that current AI techniques, such as LLMs, will scale to AGI. There are many reasons to agree with the majority of experts on this question, some of which I briefly listed in a post here.
    
    Because I don’t see scaling up LLMs as a viable path to AGI, I personally don’t see much value in AI safety research that assumes that it is a viable path. (To be clear, AI safety research that is about things like how LLM-based chatbots can safely respond to users who express suicidal ideation, and not be prompted into saying something harmful or dangerous, could potentially be very valuable, but that’s about present-day use cases of LLMs and not about AGI or global catastrophic risk, which is what we’ve been talking about.) In general, I’m very sympathetic to a precautionary, “better safe than sorry” approach, but, to me, AI safety or alignment research can’t even be justified on those grounds. The chance of LLMs scaling up to AGI seems so remote.
    
    It’s also unlike the remote chance of an asteroid strike, where we have hard science that can be used to calculate that probability rigorously. It’s more like the remote chance that the Large Hadron Collider (LHC) would create a black hole, which can only be assigned a probability above zero because of fundamental epistemic uncertainty, i.e., based on the chance that we’ve gotten the laws of physics wrong. I don’t know if I can quite put my finger on why I don’t like a form of argument in favour of practical measures to mitigate existential risk based on fundamental epistemic uncertainty. I can point out that it would seem to lead to have some very bizarre implications.
    
    For example, what probability do we assign to the possibility that Christian fundamentalism is correct? If we assign a probability above zero, then this leads us literally to Pascal’s wager, because the utility of heaven is infinite, the disutility of hell is infinite, and the cost of complying with the Christian fundamentalist requirements for going to heaven are not only finite but relatively modest. Reductio ad absurdum?
    
    By contrast, we know for sure dangerous asteroids are out there, we know they’ve hit Earth before, and we have rigorous techniques for observing them, tracking them, and predicting their trajectories. When NASA says there’s a 1 in 10,000 chance of an asteroid hitting Earth, that’s an entirely different kind of a probability than if a Bayesian-utilitarian guesses there’s a 1 in 10,000 chance that Christian fundamentalism is correct, that the LHC will create a black hole, or that LLMs will scale to AGI within two decades.
    
    One way I can try to articulate my dissatisfaction with the argument that we should do AI safety research anyway, just in case, is to point out there’s no self-evident or completely neutral or agnostic perspective from which to work on AGI safety. For example, what if the first AGIs we build would otherwise have been safe, aligned, and friendly, but by applying our alignment techniques developed from AI safety research, we actually make them incredibly dangerous and cause a global catastrophe? How do we know which kind of action is actually precautionary?
    
    I could also make the point that, in some very real and practical sense, all AI research is a tradeoff between other kinds of AI research that could have been done instead. So, maybe instead of focusing on LLMs, it’s wiser to focus on alternative ideas like energy-based models, program synthesis, neuromorphic AI, or fundamental RL research. I think the approach of trying to squeeze Bayesian blood from a stone of uncertainty by making subjective guesses of probabilities can only take you so far, and pretty quickly the limitations become apparent.
    
    To fully make myself clear and put my cards completely on the table, I don’t find effective altruism’s treatment of the topic of near-term AGI to be particularly intellectually rigorous or persuasive, and I suspect at least some people in EA who currently think very near-term AGI is very likely will experience a wave of doubt when the AI investment bubble pops sometime within the next few years. There is no external event, no evidence, and no argument that can compel someone to update their views if they’re inclined enough to resist updating, but I suspect there are some people in EA who will interpret the AI bubble popping as new information and will take it as an opportunity to think carefully about their views on near-term AGI.
    
    But if you think that very near-term AGI is very likely, and if you think LLMs very likely will scale to AGI, then this implies an entirely different idea about what should be done, practically, in the area of AI safety research today, and if you’re sticking to those assumptions, then I’m the wrong person to ask about what should be done.
    - MattJ 5 Nov 2025 1:12 UTC
      1 point
      0 ∶ 0
      Parent
      Yarrow, thank you for this sharp and clarifying discussion.
      You have completely convinced me that my earlier arguments from “investment as a signal” or “LHC/Pascal’s Wager” were unrigorous, and I concede those points.
      I think I can now articulate my one, non-speculative crux.
      The “so what” of Toby Ord’s (excellent) analysis is that it provides a perfect, rigorous, “hindsight” view of the last paradigm—what I’ve been calling “Phase 1” RL for alignment.
      My core uncertainty isn’t speculative “what-if” hope. It’s that the empirical ground is shifting.
      The very recent papers we discussed (Khatri et al. on the “art” of scaling, and Tan et al. on math reasoning) are, for me, the first public, rigorous evidence for a “Phase 2″ capability paradigm.
      • They provide a causal mechanism for why the old, simple scaling data may be an unreliable predictor.
      • They show this “Phase 2” regime is different: it’s not a simple power law but a complex, recipe-dependent “know-how” problem (Khatri), and it has different efficiency dynamics (Tan).
      This, for me, is the action-relevant dilemma.
      We are no longer in a state of “pure speculation”. We are in a state of grounded, empirical uncertainty where the public research is just now documenting a new, more complex scaling regime that the private labs have been pursuing in secret.
      Given that the lead time for any serious safety work is measured in years, and the nature of the breakthrough is a proprietary, secret “recipe,” the “wait for public proof” strategy seems non-robust.
      That’s the core of my concern. I’m now much clearer on the crux of the argument, and I can’t thank you enough for pushing me to be more rigorous. This has been incredibly helpful, and I’ll leave it there.
      - Yarrow Bouchard 🔸 15 Nov 2025 3:06 UTC
        2 points
        0 ∶ 0
        Parent
        Hello, Matt. Let me just say I really appreciate your friendly, supportive, and positive approach to this conversation. It’s very nice. Discussions on the EA Forum can get pretty sour sometimes, and I’m probably not entirely blameless in that myself.
        
        You don’t have to reply if you don’t want, but I just wanted to follow up in case you still did.
        
        Can you explain what you mean about the data efficiency of the new RL techniques in the papers you mentioned? You say it’s more complex, but that doesn’t help me understand.
        
        By the way, did you use an LLM like Claude or ChatGPT to help write your comment? It has some of the hallmarks of LLM writing for me. I’m just saying this to help you — you may not realize how much LLMs’ writing style sticks out like a sore thumb (depending on how you use them) and it will likely discourage people from engaging with you if they detect that. I keep encouraging people to trust themselves as writers, trust their own voice, and reassuring them that the imperfections of their writing doesn’t make us, the readers, like it less, it makes us like it more.