AI Forecasting Question Database (Forecasting infrastructure, part 3)

This post in­tro­duces an open-source database of 76 ques­tions about AI progress, to­gether with de­tailed re­s­olu­tion con­di­tions, cat­e­gori­sa­tions and sev­eral spread­sheets com­piling out­side views and data, as well as learn­ing points about how to write good AI fore­cast­ing ques­tions. It is the third part in a se­ries of blog posts which mo­ti­vate and in­tro­duce pieces of in­fras­truc­ture in­tended to im­prove our abil­ity to fore­cast novel and un­cer­tain do­mains like AI.

Back­ground and motivation

Through our work on AI fore­cast­ing in the re­cent year, we’ve tried to write many ques­tions that track im­por­tant facets of progress, and gained some ex­pe­rience in how hard this is; and how to do it bet­ter.

In do­ing this, we’ve found that most pre­vi­ous pub­lic at­tempts at writ­ing AI fore­cast­ing ques­tions (e.g. this and this) fall prey to sev­eral failure modes that worsen the sig­nal of the ques­tions; as did many of the ques­tions we wrote our­selves. Over­all, we think op­er­a­tional­i­sa­tion is an un­solved prob­lem, and this has been the im­pe­tus be­hind the work de­scribed in this se­quence of posts on fore­cast­ing in­fras­truc­ture.

A pre­vi­ous great re­source on this topic is Allan Dafoe’s AI Gover­nance Re­search Agenda, which has an ap­pendix with fore­cast­ing desider­ata (page 52-53). This blog post com­ple­ments that agenda by adding a large num­ber of con­crete ex­am­ples.

We be­gin by cat­e­goris­ing and giv­ing ex­am­ples of some ways in which tech­ni­cal fore­cast­ing ques­tion can fail to cap­ture the im­por­tant, in­tended un­cer­tainty. (Note that the be­low ex­am­ples are not fully fleshed out ques­tions, in or­der to al­low for eas­ier read­ing.) We then de­scribe the ques­tion database we’re open-sourc­ing.

Ambiguity

Terms that can have many differ­ent mean­ings, such as “AGI” or “hard­coded knowl­edge”.

Underspecification

Re­s­olu­tion crite­ria that ne­glects to spec­ify how ques­tions should be re­solved in some pos­si­ble sce­nar­ios.

Examples

This ques­tion re­solves pos­i­tively if an ar­ti­cle in a rep­utable jour­nal ar­ti­cle finds that com­mer­cially-available au­to­mated speech recog­ni­tion is bet­ter than hu­man speech recog­ni­tion (in the sense of hav­ing a lower tran­scrip­tion er­ror rate).

If a jour­nal ar­ti­cle is pub­lished that finds that in only in some do­mains com­mer­cially-available au­to­mated speech recog­ni­tion is bet­ter (e.g. for HR meet­ings), but worse in most other do­mains, it is un­clear from the re­s­olu­tion crite­ria how this ques­tion should be re­solved.

Spu­ri­ous resolution

Edge-case re­s­olu­tions that tech­ni­cally satisfy the de­scrip­tion as writ­ten, yet fail to cap­ture the in­ten­tion of the ques­tion.Ex­am­plesPos­i­tively re­solv­ing the ques­tion:

Will there be an in­ci­dent caus­ing the loss of hu­man life in the South China Sea (a highly poli­ti­cally con­tested sea in the pa­cific ocean) by 2018?

by hav­ing a bat­tle­ship ac­ci­den­tally run over a fish­ing boat. (This is adapted from an ac­tual ques­tion used by the Good Judg­ment Pro­ject.)Pos­i­tively re­solv­ing the ques­tion:

Will an AI lab have been na­tion­al­ized by 2024?

by the US gov­ern­ment na­tion­al­is­ing GM for auto-man­u­fac­tur­ing rea­sons, yet GM nonethe­less hav­ing a self-driv­ing car re­search di­vi­sion.

Triv­ial pathways

Most of the var­i­ance in the fore­cast­ing out­come of the ques­tion is driven by an un­re­lated causal path­way to re­s­olu­tion, which “screens off” the in­tended path­ways.

A ques­tion which avoids re­s­olu­tion by triv­ial path­ways is roughly what Allan Dafoe calls an “ac­cu­rate in­di­ca­tor”:

We want [AI fore­cast­ing ques­tions] to be ac­cu­rate in­di­ca­tors, as op­posed to noisy in­di­ca­tors that are not highly cor­re­lated with the im­por­tant events. Speci­fi­cally, where E is the oc­cur­rence or near oc­cur­rence of some im­por­tant event, and Y is whether the tar­get has been reached, we want P(not Y|not E)~1, and P(Y | E) ~1. An in­di­ca­tor may fail to be in­for­ma­tive if it can be “gamed” in that there are ways of achiev­ing the in­di­ca­tor with­out the im­por­tant event be­ing near. It may be a noisy in­di­ca­tor if it de­pends on oth­er­wise ir­rele­vant fac­tors, such as whether a tar­get hap­pens to take on sym­bolic im­por­tance as the fo­cus of re­search”.

Examples

Forecasting

When will there be a su­per­hu­man An­gry Birds agent us­ing no hard­coded knowl­edge?

and re­al­iz­ing that there seems to be lit­tle ac­tive in­ter­est in the yearly bench­mark com­pe­ti­tion (with perfor­mance even de­clin­ing over years). This means that the prob­a­bil­ity en­tirely de­pends on whether any­one with enough money and com­pe­tence de­cides to work on it, as op­posed to what key com­po­nents make An­gry Birds difficult (e.g. physics-based simu­la­tion) and how fast progress is in those do­mains.

Forecasting

How sam­ple effi­cient will the best Dota 2 RL agent be in 2020?

by an­a­lyz­ing OpenAI’s de­ci­sion on whether or not to build a new agent, rather than the un­der­ly­ing difficulty and progress of RL in par­tially ob­serv­able, high-di­men­sional en­vi­ron­ments.

Forecasting

Will there have been a 2-year in­ter­val where the amount of train­ing com­pute used in the largest AI ex­per­i­ment did not grow 32x, be­fore 2024?

by an­a­lyz­ing how of­ten big ex­per­i­ments are run, rather than what the key drivers of the trend are (e.g. par­alleliz­abil­ity and com­pute eco­nomics) and how they will change in the fu­ture.

Failed pa­ram­e­ter tuning

Any free vari­ables are set to val­ues which do not max­imise un­cer­tainty.

Examples

Hav­ing the an­swer to:

When will 100% of jobs be au­tomat­able at a perfor­mance close to the me­dian em­ployee and cost <10,000x of that em­ployee?

be very differ­ent from us­ing the pa­ram­e­ter 99.999%, as cer­tain edge-case jobs (“un­der­wa­ter bas­ket weav­ing”) might be sur­pris­ingly hard to au­to­mate but for non-in­ter­est­ing rea­sons.

Similarly, ask­ing:

Will global in­vest­ment in AI R&D be <$100 trillion in 2021?

is not in­ter­est­ing, even though ask­ing about val­ues in the range of ~$30B to ~$1T might have been.

Non-in­cen­tive com­pat­i­ble questions

Ques­tions where the an­swer that would score high­est on some scor­ing met­ric is differ­ent from the fore­caster’s true be­lief.

Examples

Fore­cast­ing ”Will the world end in 2024?” as 1% (or what­ever else is the min­i­mum for the given plat­form), be­cause for any higher num­ber you wouldn’t be around to cash out the re­wards of your cal­ibra­tion.

Ques­tion database

We’re re­leas­ing a database of 76 ques­tions about AI progress, to­gether with de­tailed re­s­olu­tion con­di­tions, cat­e­gori­sa­tions and sev­eral spread­sheet com­piling out­side views and data.

We make these ques­tions freely available for use by any fore­cast­ing pro­ject (un­der an open-source MIT li­cense).

The database has grown or­gan­i­cally through our work on Me­tac­u­lus AI, and sev­eral of the ques­tions have as­so­ci­ated quan­ti­ta­tive fore­casts and dis­cus­sion on that site. The re­s­olu­tion con­di­tions have of­ten been honed and im­proved through in­ter­ac­tion with fore­cast­ers. More­over, in cases where the ques­tions stem from el­se­where, such as the BERI open-source ques­tion set, or the AI In­dex, we’ve of­ten spent a sub­stan­tial amount of time im­prov­ing the re­s­olu­tion con­di­tions in cases where they were lack­ing.

Of par­tic­u­lar in­ter­est might be the column “Ro­bust­ness”, which tracks our over­all es­ti­mate of the qual­ity of ques­tions—that is, the ex­tent to which they avoid suffer­ing from the failure modes listed above. For ex­am­ple, the ques­tion:

By 2021, will a neu­ral model reach >=70% perfor­mance on a high school math­e­mat­ics exam?

is based on a 2019 Deep­Mind pa­per, and on the face of it li­able to sev­eral of the failure modes above. Yet it’s ro­bust­ness is rated as “high”, since we have speci­fied a de­tailed re­s­olu­tion con­di­tion as:

By 2021, will there EITHER…
1. … be a cred­ible re­port of a neu­ral model with a score of >=70% on the task suite used in the 2019 Deep­Mind pa­per…
2. OR be judged by a coun­cil of ex­perts that it’s 95% likely such a model could be im­ple­mented, were a suffi­ciently com­pe­tent lab to try…
3. OR be a neu­ral model with perfor­mance on an­other bench­mark judge by a coun­cil of ex­perts to be equally im­pres­sive, with 95% con­fi­dence?

The ques­tion can still fail in win­ner’s curse/​Good­hart­ing-style cases, where the best perform­ing al­gorithm on a par­tic­u­lar bench­mark over­es­ti­mates progress in that do­main, sim­ply be­cause se­lect­ing for bench­mark perfor­mance also se­lects for overfit­ting to the bench­mark as op­posed to mas­ter­ing the un­der­ly­ing challenge. We don’t yet have a good de­fault way of re­solv­ing such ques­tions in a ro­bust man­ner.

How to con­tribute to the database

We wel­come con­tri­bu­tions to the database. Airtable (the soft­ware where we’re host­ing it) doesn’t al­low for com­ments, so if you have a list of ed­its/​ad­di­tions you’d like to make, please email hello@par­allelfore­cast.com and we can make you an ed­i­tor.