I have read about some of the work on tackling the ARC dataset, and I am not at all confident that the approaches which perform well have anything to do with generalisable reasoning. The problem remains that there is no validation that the benchmark measures what it claims to. I don’t know what methods o3 used to solve it, but until I do I don’t believe the marketing hype released by OpenAI that it must be generalisable reasoning.
As to why we’d see inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, this is still an open question but it seems to be partly driven by increased compute time and number of tokens. I don’t have the full answer here, but the evidence we do have strongly cautions against just assuming these models are doing what we might describe as ‘genuine reasoning’.
Hi Toby, thanks for the comment.
I have read about some of the work on tackling the ARC dataset, and I am not at all confident that the approaches which perform well have anything to do with generalisable reasoning. The problem remains that there is no validation that the benchmark measures what it claims to. I don’t know what methods o3 used to solve it, but until I do I don’t believe the marketing hype released by OpenAI that it must be generalisable reasoning.
As to why we’d see inference time scaling if chain-of-thought consisted of not much more than post-hoc rationalizations, this is still an open question but it seems to be partly driven by increased compute time and number of tokens. I don’t have the full answer here, but the evidence we do have strongly cautions against just assuming these models are doing what we might describe as ‘genuine reasoning’.