One problem is putting everything on a common scale when historical improvements are so sensitive to the distribution of tasks. A human with a computer with C, compared to a human with just log tables, is a billion times faster at multiplying numbers but less than twice as fast at writing a novel. So your distribution of tasks has to be broad enough that it captures the capabilities you care about, but it also must be possible to measure a baseline score at low tech level and have a wide range of possible scores. This would make the benchmark extremely difficult to construct in practice.
I think that’s right, but modern AI benchmarks seem to have much the same issue. A human with a modern Claude instance might be able to write code 100x faster than without, but probably less than 2x as fast at choosing a birthday present for a friend.
Ideally you want to integrate over… something to do with the set of all tasks. But it’s hard to say what that something would be, let alone how you’re going to meaningfully integrate it.
One problem is putting everything on a common scale when historical improvements are so sensitive to the distribution of tasks. A human with a computer with C, compared to a human with just log tables, is a billion times faster at multiplying numbers but less than twice as fast at writing a novel. So your distribution of tasks has to be broad enough that it captures the capabilities you care about, but it also must be possible to measure a baseline score at low tech level and have a wide range of possible scores. This would make the benchmark extremely difficult to construct in practice.
I think that’s right, but modern AI benchmarks seem to have much the same issue. A human with a modern Claude instance might be able to write code 100x faster than without, but probably less than 2x as fast at choosing a birthday present for a friend.
Ideally you want to integrate over… something to do with the set of all tasks. But it’s hard to say what that something would be, let alone how you’re going to meaningfully integrate it.