Annual AGI Benchmarking Event

Link post

Metaculus is strongly considering organizing an annual AGI benchmarking event. Once a year, we’d run a benchmark or suite of benchmarks against the most generally intelligent AI systems available to us at the time, seeking to assess their generality and the overall shape of their capabilities. We would publicize the event widely among the AI research, policy, and forecasting communities.

Why?

We think this might be a good idea for several reasons:

The event could provide a convening ground for the AI research community, helping it to arrive at a shared understanding of the current state of AGI research, and acting as a focal point for rational discussion on the future of AGI.
An annual benchmarking event has advantages over static, run-any-time benchmarks when it comes to testing generality. Unless one constrains the training data and restricts the hard-coded knowledge used by systems under evaluation, developers may directly optimize for a static benchmark while building their systems, which makes static benchmarks less useful as measures of generality. With the annual format, we are free to change the tasks every year without informing developers of what they will be beforehand, thereby assessing what François Chollet terms developer-aware generalization.
Frequent feedback improves performance in almost any domain; this event could provide a target for AGI forecasting that yields yearly feedback, allowing us to iterate on our approaches and hone our understanding of how to forecast the development of AGI.
The capabilities of an AGI will not be completely boundless, so it’s interesting to ask what its strengths and limitations are likely to be. If designed properly, our benchmarks could give us clues as to what the “shape” of AGI capabilities may turn out to be.

How?

We’re currently working on a plan, and are soliciting ideas and feedback from the community here. To guide the discussion, here are some properties we think the ideal benchmark should have. It would:

Engage a broad, diverse set of AI researchers, and act as a focal point for rational, forecasting-based discussion on the future of AGI.
Measure the generality and adaptability of intelligent systems, not just their performance on a fixed, known-beforehand set of tasks.
Form the basis for AGI forecasting questions with a one-year lifetime.
Generate predictive signal as to the types of capabilities that an AGI system is likely to possess.
Provide a quantitative measure of generality, ranking more general systems above less general ones, rather than giving a binary “general or not” outcome.
Be sensitive to differences in generality even among the weakly general systems available today.

Once we’ve collected the strongest ideas and developed them into a cohesive whole, we will solicit feedback from the AI research community before publishing the final plan. Thanks for your contributions to the discussion – we look forward to reading and engaging with your ideas!

Background reading

Here are a few resources to get you thinking.

Threads:

An idea based on iteratively crowdsourcing adversarial questions

A discussion on AGI benchmarking

Papers:

On the Measure of Intelligence

What we might look for in an AGI benchmark

General intelligence disentangled via a generality metric for natural and artificial intelligence