One of the problems with AI benchmarks is that they can’t effectively be backcast more than a couple of years. This prompted me to wonder if a more empirical benchmark might be something like ‘Ability of a human in conjunction with the best technology available at time t’.
For now at least, humans are still necessary to have in the loop, so this should in principle be at least as good as coding benchmarks for gauging where we are now. When/if humans become irrelevant, it should still work - ‘AI capability + basically nothing’ = ‘AI capability’. And looking back, it gives a much bigger reference class for forecasting future trends, allowing us to compare e.g.
One problem is putting everything on a common scale when historical improvements are so sensitive to the distribution of tasks. A human with a computer with C, compared to a human with just log tables, is a billion times faster at multiplying numbers but less than twice as fast at writing a novel. So your distribution of tasks has to be broad enough that it captures the capabilities you care about, but it also must be possible to measure a baseline score at low tech level and have a wide range of possible scores. This would make the benchmark extremely difficult to construct in practice.
I think that’s right, but modern AI benchmarks seem to have much the same issue. A human with a modern Claude instance might be able to write code 100x faster than without, but probably less than 2x as fast at choosing a birthday present for a friend.
Ideally you want to integrate over… something to do with the set of all tasks. But it’s hard to say what that something would be, let alone how you’re going to meaningfully integrate it.
One of the problems with AI benchmarks is that they can’t effectively be backcast more than a couple of years. This prompted me to wonder if a more empirical benchmark might be something like ‘Ability of a human in conjunction with the best technology available at time t’.
For now at least, humans are still necessary to have in the loop, so this should in principle be at least as good as coding benchmarks for gauging where we are now. When/if humans become irrelevant, it should still work - ‘AI capability + basically nothing’ = ‘AI capability’. And looking back, it gives a much bigger reference class for forecasting future trends, allowing us to compare e.g.
Human
Human + paper & pen
Human + log tables + paper & pen
Human + calculator + log tables + paper & pen
Human + computer with C + …
Human + computer with Python + …
Human + ML libraries + …
Human + GPT 1 + …
etc.
Thoughts?
One problem is putting everything on a common scale when historical improvements are so sensitive to the distribution of tasks. A human with a computer with C, compared to a human with just log tables, is a billion times faster at multiplying numbers but less than twice as fast at writing a novel. So your distribution of tasks has to be broad enough that it captures the capabilities you care about, but it also must be possible to measure a baseline score at low tech level and have a wide range of possible scores. This would make the benchmark extremely difficult to construct in practice.
I think that’s right, but modern AI benchmarks seem to have much the same issue. A human with a modern Claude instance might be able to write code 100x faster than without, but probably less than 2x as fast at choosing a birthday present for a friend.
Ideally you want to integrate over… something to do with the set of all tasks. But it’s hard to say what that something would be, let alone how you’re going to meaningfully integrate it.