Ah, that’s a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn’t quite got this message from your post.
My understanding of Francois Chollet’s position (he’s where I first heard the comparison of logarithmic inference-time scaling to brute force search—before I saw Toby’s thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search—but whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results don’t prove his earlier positions wrong. People don’t like to admit when they’re wrong! But this view still seems plausible to me, it contradicts the ‘trading off’ narrative, and I’d be extremely interested to know which picture is correct. I’ll have to read that paper!
But I guess maybe it doesn’t matter a lot in practice, in terms of the impact that reasoning models are capable of having.
Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.
Ah, that’s a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn’t quite got this message from your post.
My understanding of Francois Chollet’s position (he’s where I first heard the comparison of logarithmic inference-time scaling to brute force search—before I saw Toby’s thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search—but whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results don’t prove his earlier positions wrong. People don’t like to admit when they’re wrong! But this view still seems plausible to me, it contradicts the ‘trading off’ narrative, and I’d be extremely interested to know which picture is correct. I’ll have to read that paper!
But I guess maybe it doesn’t matter a lot in practice, in terms of the impact that reasoning models are capable of having.
Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.