Metaculus is strongly considering organizing an annual AGI benchmarking event. Once a year, we’d run a benchmark or suite of benchmarks against the most generally intelligent AI systems available to us at the time, seeking to assess their generality and the overall shape of their capabilities. We would publicize the event widely among the AI research, policy, and forecasting communities.
Why?
We think this might be a good idea for several reasons:
The event could provide a convening ground for the AI research community, helping it to arrive at a shared understanding of the current state of AGI research, and acting as a focal point for rational discussion on the future of AGI.
An annual benchmarking event has advantages over static, run-any-time benchmarks when it comes to testing generality. Unless one constrains the training data and restricts the hard-coded knowledge used by systems under evaluation, developers may directly optimize for a static benchmark while building their systems, which makes static benchmarks less useful as measures of generality. With the annual format, we are free to change the tasks every year without informing developers of what they will be beforehand, thereby assessing what François Chollet terms developer-aware generalization.
Frequent feedback improves performance in almost any domain; this event could provide a target for AGI forecasting that yields yearly feedback, allowing us to iterate on our approaches and hone our understanding of how to forecast the development of AGI.
The capabilities of an AGI will not be completely boundless, so it’s interesting to ask what its strengths and limitations are likely to be. If designed properly, our benchmarks could give us clues as to what the “shape” of AGI capabilities may turn out to be.
How?
We’re currently working on a plan, and are soliciting ideas and feedback from the community here. To guide the discussion, here are some properties we think the ideal benchmark should have. It would:
Engage a broad, diverse set of AI researchers, and act as a focal point for rational, forecasting-based discussion on the future of AGI.
Measure the generality and adaptability of intelligent systems, not just their performance on a fixed, known-beforehand set of tasks.
Form the basis for AGI forecasting questions with a one-year lifetime.
Generate predictive signal as to the types of capabilities that an AGI system is likely to possess.
Provide a quantitative measure of generality, ranking more general systems above less general ones, rather than giving a binary “general or not” outcome.
Be sensitive to differences in generality even among the weakly general systems available today.
Once we’ve collected the strongest ideas and developed them into a cohesive whole, we will solicit feedback from the AI research community before publishing the final plan. Thanks for your contributions to the discussion – we look forward to reading and engaging with your ideas!
Background reading
Here are a few resources to get you thinking.
Threads:
An idea based on iteratively crowdsourcing adversarial questions
A discussion on AGI benchmarking
Papers:
On the Measure of Intelligence
What we might look for in an AGI benchmark
General intelligence disentangled via a generality metric for natural and artificial intelligence
Thanks for soliciting public feedback on this. Unfortunately I’m worried that publicizing this could be net negative though I’m not very confident in this. My worry is that humans are good at making numbers go up and will be driven by highly publicized benchmarks to try to get higher scores, and thus this event would make capabilities go faster than they otherwise would, which would be bad.
I certainly realize it could be good to be able to more easily resolve Metaculus forecasts and also it could be helpful to get more insight into capabilities that might otherwise be hidden from the public, but my weakly held view and the view of at least three other people working at or associated with Rethink Priorities also feel the same (and also with weak confidence) but preferred their views to be anonymous for now.
Thank you for the feedback. This is an important and valid concern. Similar concerns were raised on the discussion thread over at Metaculus, and we’ve replied with some thoughts there. It’s worth mentioning that I don’t think we should move forward with anything until we’ve carefully considered the consequences – probably using forecasting to help with this – and gotten feedback from several disinterested parties.
I’ve thought a little more, at a very high level, about how an event like this might be designed in order to be beneficial overall, and written the idea up here.