SummaryBot comments on Benchmark Performance is a Poor Measure of Generalisable AI Reasoning Capabilities