This was a thought-provoking and quite scary summary of what reasoning models might mean.
I think this sentence may have a mistake though:
“you can have GPT-o1 think 100-times longer than normal, and get linear increases in accuracy on coding problems.”
Doesn’t the graph show that the accuracy gains are only logarithmic? The x-axis is a log scale.
This logarithmic relationship between performance and test-time compute is characteristic of brute-force search, and maybe is the one part of this story that means the consequences won’t be quite so explosive? Or have I misunderstood?
For test time compute, you need to do logarithmic increases of compute to get linear increases in accuracy on the benchmark. It’s similar to the pretraining scaling law.
I agree test time compute isn’t especially explosive – it mainly serves to “pull forward” more advanced capabilities by 1-2 years.
On brute force, I mainly took Toby’s thread to be saying we don’t clearly have enough information to know how effective test time compute is vs. brute force.
Ah, that’s a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn’t quite got this message from your post.
My understanding of Francois Chollet’s position (he’s where I first heard the comparison of logarithmic inference-time scaling to brute force search—before I saw Toby’s thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search—but whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results don’t prove his earlier positions wrong. People don’t like to admit when they’re wrong! But this view still seems plausible to me, it contradicts the ‘trading off’ narrative, and I’d be extremely interested to know which picture is correct. I’ll have to read that paper!
But I guess maybe it doesn’t matter a lot in practice, in terms of the impact that reasoning models are capable of having.
Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.
This was a thought-provoking and quite scary summary of what reasoning models might mean.
I think this sentence may have a mistake though:
“you can have GPT-o1 think 100-times longer than normal, and get linear increases in accuracy on coding problems.”
Doesn’t the graph show that the accuracy gains are only logarithmic? The x-axis is a log scale.
This logarithmic relationship between performance and test-time compute is characteristic of brute-force search, and maybe is the one part of this story that means the consequences won’t be quite so explosive? Or have I misunderstood?
For test time compute, you need to do logarithmic increases of compute to get linear increases in accuracy on the benchmark. It’s similar to the pretraining scaling law.
I agree test time compute isn’t especially explosive – it mainly serves to “pull forward” more advanced capabilities by 1-2 years.
More broadly, you can swap training for inference: https://epoch.ai/blog/trading-off-compute-in-training-and-inference
On brute force, I mainly took Toby’s thread to be saying we don’t clearly have enough information to know how effective test time compute is vs. brute force.
Ah, that’s a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn’t quite got this message from your post.
My understanding of Francois Chollet’s position (he’s where I first heard the comparison of logarithmic inference-time scaling to brute force search—before I saw Toby’s thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search—but whatever the magic ingredient is he has acknowledged that o3 has it).
Of course this could just be his way of explaining why the o3 ARC results don’t prove his earlier positions wrong. People don’t like to admit when they’re wrong! But this view still seems plausible to me, it contradicts the ‘trading off’ narrative, and I’d be extremely interested to know which picture is correct. I’ll have to read that paper!
But I guess maybe it doesn’t matter a lot in practice, in terms of the impact that reasoning models are capable of having.
Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.
Agreed, “linear increases” seems to be an incorrect reading of the graph.