Peter comments on How Well Does RL Scale?

Peter 16 Feb 2026 19:37 UTC
2 points
0 ∶ 0
I think the post makes a useful snapshot point: for some public “reasoning model vs base” comparisons, a large fraction of measured uplift shows up as increased inference-time deliberation.
But I don’t think that supports the stronger, extrapolative claim that “RL is extremely inefficient for frontier models” or that “recent gains are mostly inference” in a way that should generalize.
Two missing gears:
1. Inference-time compute is not just a permanent tax: it can become training signal via an inference → data → filtering/verification → distillation loop. So “mostly inference” today can still translate into “training gains” over time, and the persistence of the burden is an empirical question.
2. Attribution is underdetermined from public model comparisons: these decompositions confound base model changes, post-training changes, and eval settings. Without fixed-budget comparisons and multi-generation ablations, it’s hard to justify strong global conclusions from a narrow snapshot.
So I’d endorse a modest takeaway (“inference scaling is currently important”), but I’m skeptical of treating observed RLVR curves and one-generation decompositions as a stable law.