Having followed a lot of AI benchmarks over the years, my main heuristic takeaway regarding expert-parity claims is “prepare to be disappointed once you dig in”, alongside “but they were still useful in advancing understanding and progress”, cf. SemiAnalysis’ Benchmarks are bad but we need to keep using them anyways section for an outside-of-EA perspective. I’m also less bullish on long-range poor-feedback loops superforecasting more generally for reasons along the lines of superforecaster Eli Lifland’s takes (esp. #2 and #4), Dan Luu’s appendix notes and comparisons to the actually-accurate futurists his review found, nostalgebraist on metaculus badness, etc which collectively reduce my enthusiasm for automating this.
Having followed a lot of AI benchmarks over the years, my main heuristic takeaway regarding expert-parity claims is “prepare to be disappointed once you dig in”, alongside “but they were still useful in advancing understanding and progress”, cf. SemiAnalysis’ Benchmarks are bad but we need to keep using them anyways section for an outside-of-EA perspective. I’m also less bullish on long-range poor-feedback loops superforecasting more generally for reasons along the lines of superforecaster Eli Lifland’s takes (esp. #2 and #4), Dan Luu’s appendix notes and comparisons to the actually-accurate futurists his review found, nostalgebraist on metaculus badness, etc which collectively reduce my enthusiasm for automating this.