I guess there could have recently been a major breakthrough in RL at any of the major AI companies that the public doesnât know about yet. Or there could be one soon that we wouldnât know about right away. But why think that is the case? And why think that is more like at this particular point in time than any other time within the last 10 years or so?
Can you explain what âLLMs as reward modelsâ and âverifiable domains for self-playâ mean and why these would make RL dramatically more compute efficient? Iâm guessing that âLLMs as reward modelsâ means that the representational power of LLMs is far greater than for RL agents in the past. But hasnât RLHF been used on LLMs since before the first version of ChatGPT? So wouldnât our idea of how quickly LLMs learn or improve using RL from the past 3 years or so already account for LLMs as world models?
By âverifiable domains for self-playâ, do you we have benchmarks or environments that automatically gradable and can provide a reward signal without a human manually taking any action? If so, again, that seems like something that should already be accounted for in the last 3 years or so of data.
If what youâre saying is that LLMs as reward models or verifiable domains for self-play could contribute to research or innovation in RL such that a major breakthrough in RL compute efficiency is more likely, I donât follow the reasoning there.
You also mentioned âunprecedented compute for experimentsâ, which maybe could be a factor that will contribute to the likelihood of such a breakthrough, but who knows. Why couldnât you test an idea for more compute efficient RL on a small number of GPUs first and see if you get early results? Why would having a lot more GPUs help? With a lot of GPUs, you could test more ideas in parallel, but is the limiting factor really the ability to test ideas or is it coming up with new ideas in the first place?
Yarrow, these are fantastic, sharp questions. Your âalready accounted forâ point is the strongest counter-argument Iâve encountered.
Youâre correct in your interpretation of the terms. And your core challengeâif LLM reward models and verifiable domains have existed for ~3 years, shouldnât their impact already be visible?âis exactly what Iâm grappling with.
Let me try to articulate my hypothesis more precisely:
The Phase 1 vs Phase 2 distinction:
I wonder if weâre potentially conflating two different uses of RL that might have very different efficiency profiles:
1. Phase 1 (Alignment/âStyle): This is the RLHF that created ChatGPTâsteering a pretrained model to be helpful/âharmless. This has been done for ~3 years and is probably whatâs reflected in public benchmark data.
2. Phase 2 (Capability Gains): This is using RL to make models fundamentally more capable at tasks through extended reasoning or self-play (e.g., o1, AlphaGo-style approaches).
My uncertainty is: could âPhase 2â RL have very different efficiency characteristics than âPhase 1â?
Recent academic evidence:
Some very recent papers seem to directly address this question:
⢠A paper by Khatri et al., âThe Art of Scaling Reinforcement Learning Compute for LLMsâ (arXiv: 2510.13786), appears to show that simple RL methods do hit hard performance ceilings (validating your skepticism), but that scaling RL is a complex âart.â It suggests a specific recipe (ScaleRL) can achieve predictable scaling. This hints the bottleneck might be âknow-howâ rather than a fundamental limit.
⢠Another paper by Tan et al., âScaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoningâ (arXiv: 2509.25300), on scaling RL for math found that performance is more bound by data quality (like from verifiable domains) than just compute, and that larger models are more compute- and sample-efficient at these tasks.
Why this seems relevant:
This research suggests âPhase 1â RL (simple, public methods) and âPhase 2â RL (complex recipes, high-quality data, large models) might have quite different scaling properties.
This makes me wonder if the scaling properties from prior RL research might not fully capture whatâs possible in this new regime: very large models + high-quality verifiable domains + substantial compute + the right training recipe. Prior research isnât irrelevant, but perhaps extrapolation from it is unreliable when the conditions are changing this much?
If labs have found (or are close to finding) these âsecret recipesâ for scalable RL, that could explain continued capital investment from well-informed actors despite public data showing plateaus.
The action-relevant dilemma:
Even granting the epistemic uncertainty, there seems to be a strategic question: Given long lead times for safety research, should researchers hedge by preparing for RL efficiency improvements, even if we canât confidently predict them?
The asymmetry: if we wait for public evidence before starting safety work, and RL does become substantially more efficient (because a lab finds the right ârecipeâ), weâll have even less lead time. But if we prepare unnecessarily, weâve misallocated resources.
I donât have a clean answer to what probability threshold for a potential breakthrough justifies heightened precautionary work. But the epistemic uncertainty itselfâcombined with some papers suggesting the scaling regime might be fundamentally different than assumedâmakes me worry whether weâre evaluating the efficiency of propellers while jet engines are being invented in private.
Does this change your analysis at all, or do you think the burden of proof still requires more than theoretical papers about potential scaling regimes?
Do the two papers you mentioned give specific quantitative information about how much LLM performance increases as the compute used for RL scales? And is it a substantially more efficient scaling than what Toby Ord assumes in the post above?
In terms of AI safety research, this is getting into a very broad, abstract, general, philosophical point, but, personally, Iâm fairly skeptical of the idea that anybody today will be able to do AI safety research now that can be applied to much more powerful, much more general AI systems in the future. I guess if you think the more powerful, more general AI systems of the future will just be bigger versions of the type of systems we have today, then it makes sense why youâd think AI safety research would be useful now. But I think there are good reasons for doubting that, and LLM scaling running out of steam is just one of those good reasons.
To take a historical example, the Machine Intelligence Research Institute (MIRI) had some very specific ideas about AI safety and alignment dating back to before the deep learning revolution that started around 2012. I recall having an exchange with Eliezer Yudkowsky, who co-founded MIRI and does research there, on Facebook sometime around 2015-2017 where he expressed doubt that deep learning was the way to get to AGI and said his best bet was that symbolic AI was the most promising approach. At some point, he must have changed his mind, but I canât find any writing heâs done or any talk or interview where he explains when and why his thinking changed.
[Edited on 2026-01-18 at 20:55 UTC to add: I misremembered some important details about my exchanges on Facebook with Eliezer Yudkowsky and another person at MIRI, Rob Bensinger, about deep learning and other AI paradigms around 2016-2018. Take my struckthrough recollections above as unreliable memory. I went through the trouble of digging up some old Facebook comments and detailed what I found here.]
In any case, one criticism â which I agree with â that has been made of Yudkowskyâs and MIRIâs current ideas about AI safety and alignment is that these ideas have not been updated in the last 13 years, and remain the same ideas that Yudkowsky and MIRI were advocating before the deep learning revolution. And there are strong reasons to doubt they still apply to frontier AI systems, if they ever did. What we would expect from Yudkowsky and MIRI at this point is either an updating of their ideas about safety and alignment, or an explanation of why their ideas developed with symbolic AI in mind should still apply, without modification, to deep learning-based systems. Itâs hard to understand why this point hasnât been addressed, particularly since people have been bringing it up for years. It comes across, in the words of one critic, as a sign of thinkers who are âpersistently unable to update their priors.â
What I just said about MIRIâs views on AI safety and alignment could be applied to AI safety more generally. Ideas developed on the assumption that current techniques, architectures, designs, or paradigms will scale all the way to AGI could turn out to be completely useless and irrelevant if it turns out that more powerful and more general AI systems will be built using entirely novel ideas that we canât anticipate yet. You used an aviation analogy. Let me try my own. Research on AI safety that assumes LLMs will scale to AGI and is therefore based on studying the properties peculiar to LLMs might turn out to be a waste of time if technology goes in another direction, just as aviation safety research that assumed airships would be the technology that will underlie air travel and focused on the properties of hydrogen and helium gas has no relevance to a world where air travel is powered by airplanes that are heavier than air.
Itâs relevant to bring up at this point that a survey of AI experts found that 76% of them think that itâs unlikely or very unlikely that current AI techniques, such as LLMs, will scale to AGI. There are many reasons to agree with the majority of experts on this question, some of which I briefly listed in a post here.
Because I donât see scaling up LLMs as a viable path to AGI, I personally donât see much value in AI safety research that assumes that it is a viable path. (To be clear, AI safety research that is about things like how LLM-based chatbots can safely respond to users who express suicidal ideation, and not be prompted into saying something harmful or dangerous, could potentially be very valuable, but thatâs about present-day use cases of LLMs and not about AGI or global catastrophic risk, which is what weâve been talking about.) In general, Iâm very sympathetic to a precautionary, âbetter safe than sorryâ approach, but, to me, AI safety or alignment research canât even be justified on those grounds. The chance of LLMs scaling up to AGI seems so remote.
Itâs also unlike the remote chance of an asteroid strike, where we have hard science that can be used to calculate that probability rigorously. Itâs more like the remote chance that the Large Hadron Collider (LHC) would create a black hole, which can only be assigned a probability above zero because of fundamental epistemic uncertainty, i.e., based on the chance that weâve gotten the laws of physics wrong. I donât know if I can quite put my finger on why I donât like a form of argument in favour of practical measures to mitigate existential risk based on fundamental epistemic uncertainty. I can point out that it would seem to lead to have some very bizarre implications.
For example, what probability do we assign to the possibility that Christian fundamentalism is correct? If we assign a probability above zero, then this leads us literally to Pascalâs wager, because the utility of heaven is infinite, the disutility of hell is infinite, and the cost of complying with the Christian fundamentalist requirements for going to heaven are not only finite but relatively modest. Reductio ad absurdum?
By contrast, we know for sure dangerous asteroids are out there, we know theyâve hit Earth before, and we have rigorous techniques for observing them, tracking them, and predicting their trajectories. When NASA says thereâs a 1 in 10,000 chance of an asteroid hitting Earth, thatâs an entirely different kind of a probability than if a Bayesian-utilitarian guesses thereâs a 1 in 10,000 chance that Christian fundamentalism is correct, that the LHC will create a black hole, or that LLMs will scale to AGI within two decades.
One way I can try to articulate my dissatisfaction with the argument that we should do AI safety research anyway, just in case, is to point out thereâs no self-evident or completely neutral or agnostic perspective from which to work on AGI safety. For example, what if the first AGIs we build would otherwise have been safe, aligned, and friendly, but by applying our alignment techniques developed from AI safety research, we actually make them incredibly dangerous and cause a global catastrophe? How do we know which kind of action is actually precautionary?
I could also make the point that, in some very real and practical sense, all AI research is a tradeoff between other kinds of AI research that could have been done instead. So, maybe instead of focusing on LLMs, itâs wiser to focus on alternative ideas like energy-based models, program synthesis, neuromorphic AI, or fundamental RL research. I think the approach of trying to squeeze Bayesian blood from a stone of uncertainty by making subjective guesses of probabilities can only take you so far, and pretty quickly the limitations become apparent.
To fully make myself clear and put my cards completely on the table, I donât find effective altruismâs treatment of the topic of near-term AGI to be particularly intellectually rigorous or persuasive, and I suspect at least some people in EA who currently think very near-term AGI is very likely will experience a wave of doubt when the AI investment bubble pops sometime within the next few years. There is no external event, no evidence, and no argument that can compel someone to update their views if theyâre inclined enough to resist updating, but I suspect there are some people in EA who will interpret the AI bubble popping as new information and will take it as an opportunity to think carefully about their views on near-term AGI.
But if you think that very near-term AGI is very likely, and if you think LLMs very likely will scale to AGI, then this implies an entirely different idea about what should be done, practically, in the area of AI safety research today, and if youâre sticking to those assumptions, then Iâm the wrong person to ask about what should be done.
Yarrow, thank you for this sharp and clarifying discussion.
You have completely convinced me that my earlier arguments from âinvestment as a signalâ or âLHC/âPascalâs Wagerâ were unrigorous, and I concede those points.
I think I can now articulate my one, non-speculative crux.
The âso whatâ of Toby Ordâs (excellent) analysis is that it provides a perfect, rigorous, âhindsightâ view of the last paradigmâwhat Iâve been calling âPhase 1â RL for alignment.
My core uncertainty isnât speculative âwhat-ifâ hope. Itâs that the empirical ground is shifting.
The very recent papers we discussed (Khatri et al. on the âartâ of scaling, and Tan et al. on math reasoning) are, for me, the first public, rigorous evidence for a âPhase 2âł capability paradigm.
⢠They provide a causal mechanism for why the old, simple scaling data may be an unreliable predictor.
⢠They show this âPhase 2â regime is different: itâs not a simple power law but a complex, recipe-dependent âknow-howâ problem (Khatri), and it has different efficiency dynamics (Tan).
This, for me, is the action-relevant dilemma.
We are no longer in a state of âpure speculationâ. We are in a state of grounded, empirical uncertainty where the public research is just now documenting a new, more complex scaling regime that the private labs have been pursuing in secret.
Given that the lead time for any serious safety work is measured in years, and the nature of the breakthrough is a proprietary, secret ârecipe,â the âwait for public proofâ strategy seems non-robust.
Thatâs the core of my concern. Iâm now much clearer on the crux of the argument, and I canât thank you enough for pushing me to be more rigorous. This has been incredibly helpful, and Iâll leave it there.
Hello, Matt. Let me just say I really appreciate your friendly, supportive, and positive approach to this conversation. Itâs very nice. Discussions on the EA Forum can get pretty sour sometimes, and Iâm probably not entirely blameless in that myself.
You donât have to reply if you donât want, but I just wanted to follow up in case you still did.
Can you explain what you mean about the data efficiency of the new RL techniques in the papers you mentioned? You say itâs more complex, but that doesnât help me understand.
By the way, did you use an LLM like Claude or ChatGPT to help write your comment? It has some of the hallmarks of LLM writing for me. Iâm just saying this to help you â you may not realize how much LLMsâ writing style sticks out like a sore thumb (depending on how you use them) and it will likely discourage people from engaging with you if they detect that. I keep encouraging people to trust themselves as writers, trust their own voice, and reassuring them that the imperfections of their writing doesnât make us, the readers, like it less, it makes us like it more.
I guess there could have recently been a major breakthrough in RL at any of the major AI companies that the public doesnât know about yet. Or there could be one soon that we wouldnât know about right away. But why think that is the case? And why think that is more like at this particular point in time than any other time within the last 10 years or so?
Can you explain what âLLMs as reward modelsâ and âverifiable domains for self-playâ mean and why these would make RL dramatically more compute efficient? Iâm guessing that âLLMs as reward modelsâ means that the representational power of LLMs is far greater than for RL agents in the past. But hasnât RLHF been used on LLMs since before the first version of ChatGPT? So wouldnât our idea of how quickly LLMs learn or improve using RL from the past 3 years or so already account for LLMs as world models?
By âverifiable domains for self-playâ, do you we have benchmarks or environments that automatically gradable and can provide a reward signal without a human manually taking any action? If so, again, that seems like something that should already be accounted for in the last 3 years or so of data.
If what youâre saying is that LLMs as reward models or verifiable domains for self-play could contribute to research or innovation in RL such that a major breakthrough in RL compute efficiency is more likely, I donât follow the reasoning there.
You also mentioned âunprecedented compute for experimentsâ, which maybe could be a factor that will contribute to the likelihood of such a breakthrough, but who knows. Why couldnât you test an idea for more compute efficient RL on a small number of GPUs first and see if you get early results? Why would having a lot more GPUs help? With a lot of GPUs, you could test more ideas in parallel, but is the limiting factor really the ability to test ideas or is it coming up with new ideas in the first place?
Yarrow, these are fantastic, sharp questions. Your âalready accounted forâ point is the strongest counter-argument Iâve encountered.
Youâre correct in your interpretation of the terms. And your core challengeâif LLM reward models and verifiable domains have existed for ~3 years, shouldnât their impact already be visible?âis exactly what Iâm grappling with.
Let me try to articulate my hypothesis more precisely:
The Phase 1 vs Phase 2 distinction:
I wonder if weâre potentially conflating two different uses of RL that might have very different efficiency profiles:
1. Phase 1 (Alignment/âStyle): This is the RLHF that created ChatGPTâsteering a pretrained model to be helpful/âharmless. This has been done for ~3 years and is probably whatâs reflected in public benchmark data.
2. Phase 2 (Capability Gains): This is using RL to make models fundamentally more capable at tasks through extended reasoning or self-play (e.g., o1, AlphaGo-style approaches).
My uncertainty is: could âPhase 2â RL have very different efficiency characteristics than âPhase 1â?
Recent academic evidence:
Some very recent papers seem to directly address this question:
⢠A paper by Khatri et al., âThe Art of Scaling Reinforcement Learning Compute for LLMsâ (arXiv: 2510.13786), appears to show that simple RL methods do hit hard performance ceilings (validating your skepticism), but that scaling RL is a complex âart.â It suggests a specific recipe (ScaleRL) can achieve predictable scaling. This hints the bottleneck might be âknow-howâ rather than a fundamental limit.
⢠Another paper by Tan et al., âScaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoningâ (arXiv: 2509.25300), on scaling RL for math found that performance is more bound by data quality (like from verifiable domains) than just compute, and that larger models are more compute- and sample-efficient at these tasks.
Why this seems relevant:
This research suggests âPhase 1â RL (simple, public methods) and âPhase 2â RL (complex recipes, high-quality data, large models) might have quite different scaling properties.
This makes me wonder if the scaling properties from prior RL research might not fully capture whatâs possible in this new regime: very large models + high-quality verifiable domains + substantial compute + the right training recipe. Prior research isnât irrelevant, but perhaps extrapolation from it is unreliable when the conditions are changing this much?
If labs have found (or are close to finding) these âsecret recipesâ for scalable RL, that could explain continued capital investment from well-informed actors despite public data showing plateaus.
The action-relevant dilemma:
Even granting the epistemic uncertainty, there seems to be a strategic question: Given long lead times for safety research, should researchers hedge by preparing for RL efficiency improvements, even if we canât confidently predict them?
The asymmetry: if we wait for public evidence before starting safety work, and RL does become substantially more efficient (because a lab finds the right ârecipeâ), weâll have even less lead time. But if we prepare unnecessarily, weâve misallocated resources.
I donât have a clean answer to what probability threshold for a potential breakthrough justifies heightened precautionary work. But the epistemic uncertainty itselfâcombined with some papers suggesting the scaling regime might be fundamentally different than assumedâmakes me worry whether weâre evaluating the efficiency of propellers while jet engines are being invented in private.
Does this change your analysis at all, or do you think the burden of proof still requires more than theoretical papers about potential scaling regimes?
Thank you for your kindness. I appreciate it. :)
Do the two papers you mentioned give specific quantitative information about how much LLM performance increases as the compute used for RL scales? And is it a substantially more efficient scaling than what Toby Ord assumes in the post above?
In terms of AI safety research, this is getting into a very broad, abstract, general, philosophical point, but, personally, Iâm fairly skeptical of the idea that anybody today will be able to do AI safety research now that can be applied to much more powerful, much more general AI systems in the future. I guess if you think the more powerful, more general AI systems of the future will just be bigger versions of the type of systems we have today, then it makes sense why youâd think AI safety research would be useful now. But I think there are good reasons for doubting that, and LLM scaling running out of steam is just one of those good reasons.
To take a historical example, the Machine Intelligence Research Institute (MIRI) had some very specific ideas about AI safety and alignment dating back to before the deep learning revolution that started around 2012. I
recall having an exchange with Eliezer Yudkowsky, who co-founded MIRI and does research there, on Facebook sometime around 2015-2017 where he expressed doubt that deep learning was the way to get to AGI and said his best bet was thatsymbolic AIwas the most promising approach. At some point, he must have changed his mind, but I canât find any writing heâs done or any talk or interview where he explains when and why his thinking changed.[Edited on 2026-01-18 at 20:55 UTC to add: I misremembered some important details about my exchanges on Facebook with Eliezer Yudkowsky and another person at MIRI, Rob Bensinger, about deep learning and other AI paradigms around 2016-2018. Take my struckthrough recollections above as unreliable memory. I went through the trouble of digging up some old Facebook comments and detailed what I found here.]
In any case, one criticism â which I agree with â that has been made of Yudkowskyâs and MIRIâs current ideas about AI safety and alignment is that these ideas have not been updated in the last 13 years, and remain the same ideas that Yudkowsky and MIRI were advocating before the deep learning revolution. And there are strong reasons to doubt they still apply to frontier AI systems, if they ever did. What we would expect from Yudkowsky and MIRI at this point is either an updating of their ideas about safety and alignment, or an explanation of why their ideas developed with symbolic AI in mind should still apply, without modification, to deep learning-based systems. Itâs hard to understand why this point hasnât been addressed, particularly since people have been bringing it up for years. It comes across, in the words of one critic, as a sign of thinkers who are âpersistently unable to update their priors.â
What I just said about MIRIâs views on AI safety and alignment could be applied to AI safety more generally. Ideas developed on the assumption that current techniques, architectures, designs, or paradigms will scale all the way to AGI could turn out to be completely useless and irrelevant if it turns out that more powerful and more general AI systems will be built using entirely novel ideas that we canât anticipate yet. You used an aviation analogy. Let me try my own. Research on AI safety that assumes LLMs will scale to AGI and is therefore based on studying the properties peculiar to LLMs might turn out to be a waste of time if technology goes in another direction, just as aviation safety research that assumed airships would be the technology that will underlie air travel and focused on the properties of hydrogen and helium gas has no relevance to a world where air travel is powered by airplanes that are heavier than air.
Itâs relevant to bring up at this point that a survey of AI experts found that 76% of them think that itâs unlikely or very unlikely that current AI techniques, such as LLMs, will scale to AGI. There are many reasons to agree with the majority of experts on this question, some of which I briefly listed in a post here.
Because I donât see scaling up LLMs as a viable path to AGI, I personally donât see much value in AI safety research that assumes that it is a viable path. (To be clear, AI safety research that is about things like how LLM-based chatbots can safely respond to users who express suicidal ideation, and not be prompted into saying something harmful or dangerous, could potentially be very valuable, but thatâs about present-day use cases of LLMs and not about AGI or global catastrophic risk, which is what weâve been talking about.) In general, Iâm very sympathetic to a precautionary, âbetter safe than sorryâ approach, but, to me, AI safety or alignment research canât even be justified on those grounds. The chance of LLMs scaling up to AGI seems so remote.
Itâs also unlike the remote chance of an asteroid strike, where we have hard science that can be used to calculate that probability rigorously. Itâs more like the remote chance that the Large Hadron Collider (LHC) would create a black hole, which can only be assigned a probability above zero because of fundamental epistemic uncertainty, i.e., based on the chance that weâve gotten the laws of physics wrong. I donât know if I can quite put my finger on why I donât like a form of argument in favour of practical measures to mitigate existential risk based on fundamental epistemic uncertainty. I can point out that it would seem to lead to have some very bizarre implications.
For example, what probability do we assign to the possibility that Christian fundamentalism is correct? If we assign a probability above zero, then this leads us literally to Pascalâs wager, because the utility of heaven is infinite, the disutility of hell is infinite, and the cost of complying with the Christian fundamentalist requirements for going to heaven are not only finite but relatively modest. Reductio ad absurdum?
By contrast, we know for sure dangerous asteroids are out there, we know theyâve hit Earth before, and we have rigorous techniques for observing them, tracking them, and predicting their trajectories. When NASA says thereâs a 1 in 10,000 chance of an asteroid hitting Earth, thatâs an entirely different kind of a probability than if a Bayesian-utilitarian guesses thereâs a 1 in 10,000 chance that Christian fundamentalism is correct, that the LHC will create a black hole, or that LLMs will scale to AGI within two decades.
One way I can try to articulate my dissatisfaction with the argument that we should do AI safety research anyway, just in case, is to point out thereâs no self-evident or completely neutral or agnostic perspective from which to work on AGI safety. For example, what if the first AGIs we build would otherwise have been safe, aligned, and friendly, but by applying our alignment techniques developed from AI safety research, we actually make them incredibly dangerous and cause a global catastrophe? How do we know which kind of action is actually precautionary?
I could also make the point that, in some very real and practical sense, all AI research is a tradeoff between other kinds of AI research that could have been done instead. So, maybe instead of focusing on LLMs, itâs wiser to focus on alternative ideas like energy-based models, program synthesis, neuromorphic AI, or fundamental RL research. I think the approach of trying to squeeze Bayesian blood from a stone of uncertainty by making subjective guesses of probabilities can only take you so far, and pretty quickly the limitations become apparent.
To fully make myself clear and put my cards completely on the table, I donât find effective altruismâs treatment of the topic of near-term AGI to be particularly intellectually rigorous or persuasive, and I suspect at least some people in EA who currently think very near-term AGI is very likely will experience a wave of doubt when the AI investment bubble pops sometime within the next few years. There is no external event, no evidence, and no argument that can compel someone to update their views if theyâre inclined enough to resist updating, but I suspect there are some people in EA who will interpret the AI bubble popping as new information and will take it as an opportunity to think carefully about their views on near-term AGI.
But if you think that very near-term AGI is very likely, and if you think LLMs very likely will scale to AGI, then this implies an entirely different idea about what should be done, practically, in the area of AI safety research today, and if youâre sticking to those assumptions, then Iâm the wrong person to ask about what should be done.
Yarrow, thank you for this sharp and clarifying discussion.
You have completely convinced me that my earlier arguments from âinvestment as a signalâ or âLHC/âPascalâs Wagerâ were unrigorous, and I concede those points.
I think I can now articulate my one, non-speculative crux.
The âso whatâ of Toby Ordâs (excellent) analysis is that it provides a perfect, rigorous, âhindsightâ view of the last paradigmâwhat Iâve been calling âPhase 1â RL for alignment.
My core uncertainty isnât speculative âwhat-ifâ hope. Itâs that the empirical ground is shifting.
The very recent papers we discussed (Khatri et al. on the âartâ of scaling, and Tan et al. on math reasoning) are, for me, the first public, rigorous evidence for a âPhase 2âł capability paradigm.
⢠They provide a causal mechanism for why the old, simple scaling data may be an unreliable predictor.
⢠They show this âPhase 2â regime is different: itâs not a simple power law but a complex, recipe-dependent âknow-howâ problem (Khatri), and it has different efficiency dynamics (Tan).
This, for me, is the action-relevant dilemma.
We are no longer in a state of âpure speculationâ. We are in a state of grounded, empirical uncertainty where the public research is just now documenting a new, more complex scaling regime that the private labs have been pursuing in secret.
Given that the lead time for any serious safety work is measured in years, and the nature of the breakthrough is a proprietary, secret ârecipe,â the âwait for public proofâ strategy seems non-robust.
Thatâs the core of my concern. Iâm now much clearer on the crux of the argument, and I canât thank you enough for pushing me to be more rigorous. This has been incredibly helpful, and Iâll leave it there.
Hello, Matt. Let me just say I really appreciate your friendly, supportive, and positive approach to this conversation. Itâs very nice. Discussions on the EA Forum can get pretty sour sometimes, and Iâm probably not entirely blameless in that myself.
You donât have to reply if you donât want, but I just wanted to follow up in case you still did.
Can you explain what you mean about the data efficiency of the new RL techniques in the papers you mentioned? You say itâs more complex, but that doesnât help me understand.
By the way, did you use an LLM like Claude or ChatGPT to help write your comment? It has some of the hallmarks of LLM writing for me. Iâm just saying this to help you â you may not realize how much LLMsâ writing style sticks out like a sore thumb (depending on how you use them) and it will likely discourage people from engaging with you if they detect that. I keep encouraging people to trust themselves as writers, trust their own voice, and reassuring them that the imperfections of their writing doesnât make us, the readers, like it less, it makes us like it more.