For test time compute, you need to do logarithmic increases of compute to get linear increases in accuracy on the benchmark. It’s similar to the pretraining scaling law.
I agree test time compute isn’t especially explosive – it mainly serves to “pull forward” more advanced capabilities by 1-2 years.
More broadly, you can swap training for inference: https://epoch.ai/blog/trading-off-compute-in-training-and-inference
On brute force, I mainly took Toby’s thread to be saying we don’t clearly have enough information to know how effective test time compute is vs. brute force.
Glad it’s useful! I categorise RL on chain of thought as a type of post-training, rather than test time compute. (Sometimes people lump them together as both ‘inference scaling’, but I think that’s confusing.) I agree RL opens up novel capabilities you can’t get just from next token prediction on the internet.