tobycrisford 🔸 comments on Teaching AI to reason: this year’s most important story

tobycrisford 🔸Feb 13, 2025, 9:41 PM
7 points
0 ∶ 0
This was a thought-provoking and quite scary summary of what reasoning models might mean.
I think this sentence may have a mistake though:
“you can have GPT-o1 think 100-times longer than normal, and get linear increases in accuracy on coding problems.”
Doesn’t the graph show that the accuracy gains are only logarithmic? The x-axis is a log scale.
This logarithmic relationship between performance and test-time compute is characteristic of brute-force search, and maybe is the one part of this story that means the consequences won’t be quite so explosive? Or have I misunderstood?
- Benjamin_Todd Feb 13, 2025, 11:55 PM
  4 points
  0 ∶ 0
  Parent
  For test time compute, you need to do logarithmic increases of compute to get linear increases in accuracy on the benchmark. It’s similar to the pretraining scaling law.
  I agree test time compute isn’t especially explosive – it mainly serves to “pull forward” more advanced capabilities by 1-2 years.
  More broadly, you can swap training for inference: https://epoch.ai/blog/trading-off-compute-in-training-and-inference
  On brute force, I mainly took Toby’s thread to be saying we don’t clearly have enough information to know how effective test time compute is vs. brute force.
  - tobycrisford 🔸Feb 14, 2025, 8:02 AM
    3 points
    0 ∶ 0
    Parent
    Ah, that’s a really interesting way of looking at it, that you can trade training-compute for inference-compute to only bring forward capabilities that would have come about anyway via simply training larger models. I hadn’t quite got this message from your post.
    My understanding of Francois Chollet’s position (he’s where I first heard the comparison of logarithmic inference-time scaling to brute force search—before I saw Toby’s thread) is that RL on chain of thought has unlocked genuinely new capabilities that would have been impossible simply by scaling traditional LLMs (or maybe it has to be chain of thought combined with tree-search—but whatever the magic ingredient is he has acknowledged that o3 has it).
    Of course this could just be his way of explaining why the o3 ARC results don’t prove his earlier positions wrong. People don’t like to admit when they’re wrong! But this view still seems plausible to me, it contradicts the ‘trading off’ narrative, and I’d be extremely interested to know which picture is correct. I’ll have to read that paper!
    But I guess maybe it doesn’t matter a lot in practice, in terms of the impact that reasoning models are capable of having.
    - Benjamin_Todd Feb 14, 2025, 6:56 PM
      4 points
      0 ∶ 0
      Parent
      Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.
- David M Feb 24, 2025, 10:14 AM
  2 points
  0 ∶ 0
  Parent
  Agreed, “linear increases” seems to be an incorrect reading of the graph.