[Part 2] Amplifying generalist research via forecasting – results from a preliminary exploration

This post cov­ers the set-up and re­sults from our ex­plo­ra­tion in am­plify­ing gen­er­al­ist re­search us­ing pre­dic­tions, in de­tail. It is ac­com­panied by a sec­ond post with a high-level de­scrip­tion of the re­sults, and more de­tailed mod­els of im­pact and challenges. For an in­tro­duc­tion to the pro­ject, see that post.


The rest of this post is struc­tured as fol­lows.

First, we cover the ba­sic set-up of the ex­plo­ra­tion.

Se­cond, we share some re­sults, in par­tic­u­lar fo­cus­ing on the ac­cu­racy and cost-effec­tive­ness of this method of do­ing re­search.

Third, we briefly go through some per­spec­tives on what we were try­ing to ac­com­plish and why that might be im­pact­ful, as well as challenges with this ap­proach. Th­ese are cov­ered more in-depth in a sep­a­rate post.

Over­all, we are very in­ter­ested in feed­back and com­ments on where to take this next.

Set-up of the experiment

A note on the ex­per­i­men­tal design

To be­gin with, we note that this was not an “ex­per­i­ment” in the sense of de­sign­ing a rigor­ous method­ol­ogy with ex­plicit con­trols to test a par­tic­u­lar, well-defined hy­poth­e­sis.

Rather, this might be seen as an ”ex­plo­ra­tion” [3]. We tested sev­eral differ­ent ideas at once, in­stead of run­ning a unique ex­per­i­ment for each sep­a­rately. We also in­tended to un­cover new ideas and in­spira­tion as much as test­ing ex­ist­ing ones.

More­over, we pro­ceeded in a startup-like fash­ion where sev­eral de­ci­sions were made ad-hoc. For ex­am­ple, a com­par­i­son group was in­tro­duced af­ter the first ex­per­i­ment had been com­pleted; this was not origi­nally planned, but later be­came ev­i­dently use­ful. This came at the cost of wors­en­ing the rigor of the ex­per­i­ment.

We think this trade-off was worth it for our situ­a­tion. This kind of policy al­lows us to ex­e­cute a large num­ber of ex­per­i­ments in a shorter amount of time, quickly pivot away from bad ones, and no­tice low-hang­ing mis­takes and learn­ing points be­fore scal­ing up good ones. This is es­pe­cially helpful as we’re shoot­ing for tail-end out­comes, and are look­ing for con­crete mechanisms to im­ple­ment in prac­tice (rather than pub­lish­ing par­tic­u­lar re­sults).

We do not see it as a sub­sti­tute for more rigor­ous stud­ies, but rather as a com­ple­ment, which might serve as in­spira­tion for such stud­ies in the fu­ture.

To pre­vent this from bi­as­ing the data, all re­sults from the ex­per­i­ment are pub­lic, and we try to note when de­ci­sions were made post-hoc.

Mechanism design

The ba­sic set-up of the pro­ject is shown in the fol­low­ing di­a­gram, and de­scribed be­low.

A two-sen­tence ver­sion would be:

Fore­cast­ers pre­dicted the con­clu­sions that would be reached by Eliz­a­beth van Norstrand, a gen­er­al­ist re­searcher, be­fore she con­ducted a study on the ac­cu­racy of var­i­ous his­tor­i­cal claims. We ran­domly sam­pled a sub­set of re­search claims for her to eval­u­ate, and since we can set that prob­a­bil­ity ar­bi­trar­ily low this method is not bot­tle­necked by her time.

1. Eval­u­a­tor ex­tracts claims from the book and sub­mits priors

The eval­u­a­tor for the ex­per­i­ment was Eliz­a­beth Van Norstrand, an in­de­pen­dent gen­er­al­ist re­searcher known for her “Epistemic spot checks”. This is a se­ries of posts as­sess­ing the trust­wor­thi­ness of a book by eval­u­at­ing some of it claims. We chose Eliz­a­beth for the ex­per­i­ment as she has a rep­u­ta­tion for re­li­able gen­er­al­ist re­search, and there was a sig­nifi­cant amount of pub­lic data about her past eval­u­a­tions of claims.

She picked 10 claims from the book The Un­bound Prometheus: Tech­nolog­i­cal Change and In­dus­trial Devel­op­ment in Western Europe from 1750 to the Pre­sent, as well as a meta-claim about the re­li­a­bil­ity of the book as a whole.

All claims were as­signed an im­por­tance rat­ing from 1-10 based on their rele­vance to the the­sis of the book as a whole. We were in­ter­ested in find­ing if this would in­fluence fore­caster effort be­tween ques­tions.

Eliz­a­beth also spent 3 min­utes per claim sub­mit­ting an ini­tial es­ti­mate (referred to as a “prior”).

Beliefs were typ­i­cally en­coded as dis­tri­bu­tions over the range 0% to 100%, rep­re­sent­ing where Eliz­a­beth ex­pected the mean of her pos­te­rior cre­dence in the claim to be af­ter 10 more hours of re­search. For more ex­pla­na­tion, see this foot­note [4].

2. Fore­cast­ers make predictions

Fore­cast­ers pre­dicted what they ex­pected Eliz­a­beth to say af­ter ~45 min­utes of re­search on the claim, and wrote com­ments ex­plain­ing their rea­son­ing.

Fore­cast­ers’ pay­ments for the ex­per­i­ment were pro­por­tional to how much their fore­casts out­performed the ag­gre­gate in es­ti­mat­ing her 45-minute dis­tri­bu­tions. In ad­di­tion, fore­cast­ers were paid a base sum just for par­ti­ci­pat­ing. You can see all fore­casts and com­ments here, and an in­ter­ac­tive tool for vi­su­al­is­ing and un­der­stand­ing the scor­ing scheme here.

A key part of the de­sign was that that fore­cast­ers did not know which ques­tion Eliz­a­beth would ran­domly sam­ple to eval­u­ate. Hence they were in­cen­tivised to do their best on all ques­tions (weighted by im­por­tance). This has the im­por­tant im­pli­ca­tion that we could eas­ily ex­tend the amount of ques­tions pre­dicted by fore­cast­ers—even if Eliz­a­beth can only judge 10 claims, we could have fore­cast­ing ques­tions for 100 differ­ent claims [5].

Two groups of fore­cast­ers par­ti­ci­pated in the ex­per­i­ment: one based on a mailing list with par­ti­ci­pants in­ter­ested in par­ti­ci­pat­ing in fore­cast­ing ex­per­i­ments (re­cruited from effec­tive al­tru­ism-ad­ja­cent events and other fore­cast­ing plat­forms) [6], and one re­cruited from Positly, an on­line plat­form for crowd­work­ers. The former group is here called “Net­work-ad­ja­cent fore­cast­ers” and the lat­ter “On­line crowd­work­ers”.

3. The eval­u­a­tor judges the claims

Eliz­a­beth was given a time-bud­get of 6 hours, within which she ran­domly sam­pled claims to re­search and judge.

At this point, we wanted to use the work done by fore­cast­ers to help Eliz­a­beth, while avoid­ing an­chor­ing and bi­as­ing her with their es­ti­mates.

To solve this, Eliz­a­beth was ini­tially given a filtered ver­sion of the com­ments sec­tion for each claim, which con­tained all sources and mod­els used, but which had been stripped of any ex­plicit pre­dic­tions or sub­jec­tive opinion gen­er­al­is­ing from the data.

For ex­am­ple, for the ques­tion:

Pre-In­dus­trial Bri­tain had a le­gal cli­mate more fa­vor­able to in­dus­tri­al­iza­tion than con­ti­nen­tal Europe [5].

One com­menter wrote:

Seems more likely to be true than not. The English Civil War and Glo­ri­ous Revolu­tion both sig­nifi­cantly cur­tailed the ar­bi­trary power of the monarch/​gen­try and raised the power of mer­chants in Bri­tain, mak­ing likely that gov­ern­ment was more favourable to mer­can­tile in­ter­ests. Hard to judge the claim about hag­gling.

And in Eliz­a­beth’s ini­tial briefing this was re­placed by:

The English Civil War and Glo­ri­ous Revolu­tion both sig­nifi­cantly cur­tailed the ar­bi­trary power of the monarch/​gen­try and raised the power of mer­chants in Bri­tain [...].

After ar­riv­ing at a fi­nal es­ti­mate, Eliz­a­beth was al­lowed to look at the full fore­caster com­ments and pre­dic­tions and op­tion­ally change her mind. In prac­tice, she didn’t change her mind in any of these cases.

To sum­marise, the parts in­volved were:

  • We sum­marised all of the rele­vant com­ments into a list and re­moved any de­scrip­tions that referred to peo­ple’s prob­a­bil­ities.

  • We ran­domly chose 8 of the 10 claims for re­view by Eliz­a­beth.

  • Eliz­a­beth saw this list, and spent 6 hours eval­u­at­ing the 8 claims and re­solv­ing them with prob­a­bil­ity dis­tri­bu­tions of the val­ues she ex­pected to have for them in a fu­ture pos­si­ble eval­u­a­tion round. She di­vided this time in ac­cor­dance to what seemed the most use­ful; for in­stance, ques­tions with an­swers that be­came ob­vi­ous quickly got a rel­a­tively small pro­por­tion of this time.

  • Eliz­a­beth got ac­cess to all pre­dic­tions and all com­ments and was was al­lowed to change her re­s­olu­tions. She de­cided not to in all of the cases.

4. The eval­u­a­tor dou­ble-checks the evaluations

After hav­ing spent 6 hours re­search­ing 8 claims, Eliz­a­beth ran­domly sam­pled two of those, each of which she spent an ad­di­tional 3 hours re­search­ing. For the re­main­ing claims, she waited un­til a week af­ter the ex­per­i­ment, and then reread her notes and sub­mit­ted new re­s­olu­tions, to see if her way of con­vert­ing be­liefs into num­bers was con­sis­tent over time. This part was in­tended to test the con­sis­tency and re­li­a­bil­ity of Eliz­a­beth’s eval­u­a­tions.

The out­come of this was that Eliz­a­beth ap­peared highly con­sis­tent and re­li­able. You can see the data and graphs here. Eliz­a­beth’s full notes ex­plain­ing her rea­son­ing in the eval­u­a­tions can be found here.

Re­sults and analysis

You can find all the data and in­ter­ac­tive tools for ex­plor­ing it your­self, here.

On­line crowdworkers

We were in­ter­ested in com­par­ing the perfor­mance of our pool of fore­cast­ers to “generic” par­ti­ci­pants with no prior in­ter­est or ex­pe­rience fore­cast­ing.

Hence, af­ter the con­clu­sion of the origi­nal ex­per­i­ment, we reran a slightly mod­ified form of the ex­per­i­ment with a group of fore­cast­ers re­cruited through an on­line plat­form that sources high qual­ity crowd­work­ers (who perform micro­tasks like filling out sur­veys or la­bel­ing images for ma­chine learn­ing mod­els).

How­ever, it should be men­tioned that these fore­cast­ers were op­er­at­ing un­der a num­ber of dis­ad­van­tages rel­a­tive to other par­ti­ci­pants, which means we should be care­ful when in­ter­pret­ing their perfor­mance. In par­tic­u­lar:

  • They did not know that Eliz­a­beth was the re­searcher who cre­ated the claims and would re­solve them, and so they had less in­for­ma­tion to model the per­son whose judg­ments would ul­ti­mately de­cide the ques­tions.

  • They did not use any mul­ti­modal or cus­tom dis­tri­bu­tions, which is a way to in­crease tail-un­cer­tainty and avoid large losses when fore­cast­ing with dis­tri­bu­tions. We ex­pect this was be­cause of the time-con­straints set by their pay­ment, as well as the gen­eral difficulty.

Over­all the ex­per­i­ment with these on­line crowd­work­ers pro­duced poor ac­cu­racy re­sults at pre­dict­ing Eliz­a­beth’s re­s­olu­tions (as is dis­cussed fur­ther be­low).

Ac­cu­racy of predictions

This sec­tion analy­ses how well fore­cast­ers performed, col­lec­tively, in am­plify­ing Eliz­a­beth’s re­search.

The ag­gre­gate pre­dic­tion was com­puted as the av­er­age of all fore­cast­ers’ fi­nal pre­dic­tions. Ac­cu­racy was mea­sured us­ing a ver­sion of the log­a­r­ith­mic scor­ing rule.

The fol­low­ing graph shows how the ag­gre­gate performed on each ques­tion:

The opaque bars rep­re­sent the scores from the crowd­work­ers, and the translu­cent bars, which have higher scores through­out, rep­re­sent the scores from the net­work-ad­ja­cent fore­cast­ers. It’s in­ter­est­ing that the or­der is pre­served, that is, that the ques­tion difficulty was the same for both groups. Fi­nally we don’t see any cor­re­la­tion be­tween ques­tion difficulty and the im­por­tance weights Eliz­a­beth as­signed to the ques­tions.

How­ever, the com­par­i­son is con­founded by the fact that more effort was spent from the net­work-ad­ja­cent fore­cast­ers. The above graph also doesn’t com­pare perfor­mance to Eliz­a­beth’s pri­ors. Hence we also plot the evolu­tion of the ag­gre­gate score over pre­dic­tion num­ber and time (the first data-point in the be­low graphs rep­re­sent Eliz­a­beth’s pri­ors):

For the last graph, the y-axis shows the score on a log­a­r­ith­mic scale, and the x-axis shows how far along the ex­per­i­ment is. For ex­am­ple, 14 out of 28 days would cor­re­spond to 50%. The thick lines show the av­er­age score of the ag­gre­gate pre­dic­tion, across all ques­tions, at each time-point. The shaded ar­eas show the stan­dard er­ror of the scores, so that the graph might be in­ter­preted as a guess of how the two com­mu­ni­ties would pre­dict a ran­dom new ques­tion [10].

One of our key take­aways from the ex­per­i­ment is that sim­ple av­er­age ag­gre­ga­tion al­gorithm performed sur­pris­ingly well, but only for the net­work-ad­ja­cent fore­cast­ers.

One way to see this qual­i­ta­tively is by ob­serv­ing the graphs be­low, where we dis­play Eliz­a­beth’s pri­ors, the fi­nal ag­gre­gate of the net­work-ad­ja­cent fore­cast­ers, and the fi­nal re­s­olu­tion, for a sub­set of ques­tions [11].

Ques­tion examples

The x-axis [12] refers to the Eliz­a­beth’s best es­ti­mate of the ac­cu­racy of a claim, from 0% to 100% (see sec­tion “Mechanism de­sign, 1. Eval­u­a­tor ex­tracts claims” for more de­tail).

Another way to un­der­stand the perfor­mance of the ag­gre­gate is to note that the ag­gre­gate of net­work-ad­ja­cent fore­cast­ers had an av­er­age log score of −0.5. To get a rough sense of what that means, it’s the score you’d get by be­ing 70% con­fi­dent in a bi­nary event, and be­ing cor­rect (though note that this bi­nary com­par­i­son merely serves to provide in­tu­ition, there are tech­ni­cal de­tails mak­ing the com­par­i­son to a dis­tri­bu­tional set­ting a bit tricky).

By com­par­i­son, the crowd­work­ers and Eliz­a­beth’s pri­ors had a very poor log score of around −4. This is roughly similar to the score you’d get if you pre­dict an event to be ~5% likely, and it still hap­pens.


High-level observations

This ex­per­i­ment was run to get a sense of whether fore­cast­ers could do a com­pe­tent job fore­cast­ing the work of Eliz­a­beth (i.e. as an “ex­is­tence proof”). It was not meant to show cost-effec­tive­ness, which could in­volve many effi­ciency op­ti­miza­tions not yet un­der­taken. How­ever, we re­al­ized that the net­work-ad­ja­cent fore­cast­ing may have been rea­son­ably cost-effec­tive and think that a cost-effec­tive­ness anal­y­sis of this work could provide a baseline for fu­ture in­ves­ti­ga­tions.

To com­pute the cost-effec­tive­ness of do­ing re­search us­ing am­plifi­ca­tion, we look at two mea­sures: the in­for­ma­tion gain from pre­dic­tors rel­a­tive to the eval­u­a­tor, and the cost of pre­dic­tors rel­a­tive to the eval­u­a­tor.

Benefit/​cost ra­tio = % in­for­ma­tion gain pro­vided by fore­cast­ers rel­a­tive to the eval­u­a­tor /​ % cost of fore­cast­ers rel­a­tive to the evaluator

If a benefit/​cost ra­tio of sig­nifi­cantly over 1 can be achieved, then this could mean that fore­cast­ing could be use­ful to par­tially aug­ment or re­place es­tab­lished eval­u­a­tors.

Un­der these cir­cum­stances, each unit of re­sources in­vested in gain­ing in­for­ma­tion from fore­cast­ers has higher re­turns than just ask­ing the eval­u­a­tor di­rectly.

Some ob­ser­va­tions about this.

First, note that this does not re­quire fore­cast­ers to be as ac­cu­rate as the eval­u­a­tor. For ex­am­ple, if they only provide 10% as much value, but at 1% of the op­por­tu­nity cost, this is still a good re­turn on in­vest­ment.

Se­cond, am­plifi­ca­tion can still be worth­while even if the benefit-cost ra­tio is < 1. In par­tic­u­lar:

  1. Fore­cast­ers can work in par­allel and hence an­swer a much larger num­ber of ques­tions, within a set time-frame, than would be fea­si­ble for some eval­u­a­tors.

  2. Pre-work by fore­cast­ers might also im­prove the speed and qual­ity of the eval­u­a­tor’s work, if she has ac­cess to their re­search [13].

  3. Hav­ing a low benefit-cost ra­tio can still serve as an ex­is­tence proof that am­plifi­ca­tion of gen­er­al­ist re­search is pos­si­ble, as long as the benefit is high. One might then run fur­ther op­ti­mised tests which try harder to re­duce cost.


The op­por­tu­nity cost is com­puted us­ing Guessti­mate mod­els linked be­low, based on sur­vey data from par­ti­ci­pants col­lected af­ter the ex­per­i­ment. We are at­tempt­ing to in­clude both hourly value of time and value of al­tru­is­tic ex­ter­nal­ities. We did not in­clude the time that our own team spent figur­ing out and or­ganis­ing this work.

For ex­am­ple, the es­ti­mated cost ra­tio for the net­work-ad­ja­cent fore­cast­ers in this ex­per­i­ment was 120%, mean­ing that the cost of ob­tain­ing a fi­nal ag­gre­gate pre­dic­tion for a ques­tion was 20% higher when ask­ing this group of 19 fore­cast­ers than when ask­ing Eliz­a­beth di­rectly, all things con­sid­ered.

The value is com­puted us­ing the fol­low­ing model (in­ter­ac­tive calcu­la­tion linked be­low). We as­sume Eliz­a­beth is an un­bi­ased eval­u­a­tor, and so the true value of a ques­tion is the mean of her re­s­olu­tion dis­tri­bu­tion. We then treat this point es­ti­mate as the truere­s­olu­tion, and com­pare to it the scores of Eliz­a­beth’s re­s­olu­tion, had it been a pre­dic­tion, vs. her ini­tial prior; and the fi­nal ag­gre­gate vs. her ini­tial prior. All scores are weighed by the im­por­tance of the ques­tion, as as­signed by Eliz­a­beth on a 1-10 scale [14].

Re­sults were as fol­lows.

(Links to mod­els: net­work-ad­ja­cent cost ra­tio and value ra­tio, on­line crowd­worker cost ra­tio and value ra­tio.)

The nega­tive value ra­tio for the con­trol group in­di­cates that they as­signed a lower prob­a­bil­ity to the mean of Eliz­a­beth’s re­s­olu­tion than she her her­self did when sub­mit­ting her prior. Hence just ac­cept­ing the means from those fore­casts would have made us worse off, epistem­i­cally, than trust­ing the pri­ors.

This ob­ser­va­tion is in ten­sion with the some of the above graphs, which show a tiny in­crease in av­er­age log score be­tween crowd­work­ers and Eliz­a­beth’s pri­ors. We are some­what un­cer­tain about the rea­son for this, though we think it is as fol­lows: they were worse at cap­tur­ing the re­s­olu­tion means than the prior, but they were some­times bet­ter at cap­tur­ing the re­s­olu­tion dis­tri­bu­tion (likely by the av­er­age of them adding on more un­cer­tainty). And the value ra­tio only mea­sures the former of those im­prove­ments.

Another ques­tion to con­sider when think­ing about cost-effec­tive­ness is diminish­ing re­turns. The fol­low­ing graph shows how the in­for­ma­tion gain from ad­di­tional pre­dic­tions diminished over time.

The x-axis shows the num­ber of pre­dic­tions af­ter Eliz­a­beth’s prior (which would be pre­dic­tion num­ber 0). The y-axis shows how much closer to a perfect score each pre­dic­tion moved the ag­gre­gate, as a per­centage of the dis­tance be­tween the pre­vi­ous ag­gre­gate and the perfect log score of 0 [15].

We ob­serve that for the net­work-ad­ja­cent fore­cast­ers, the ma­jor­ity of value came from the first two pre­dic­tions, while the on­line crowd­work­ers never re­li­ably re­duced un­cer­tainty. Sev­eral hy­pothe­ses might ex­plain this, in­clud­ing that:

  • The first pre­dic­tor on most ques­tions was also one of the best par­ti­ci­pants in the experiment

  • Most of the value of the pre­dic­tors came from in­creas­ing un­cer­tainty, and already af­ter av­er­ag­ing 2-3 dis­tri­bu­tions we had got­ten most of the effect there

  • Later par­ti­ci­pants were an­chored by the clearly visi­ble cur­rent ag­gre­gate and prior predictions

Fu­ture ex­per­i­ments might at­tempt to test these hy­pothe­ses.

Per­spec­tives on im­pact and challenges

This sec­tion sum­marises some differ­ent per­spec­tives on what the cur­rent ex­per­i­ment is try­ing to ac­com­plish and why that might be ex­cit­ing, as well as some of the challenges it faces. To keep things man­age­able, we sim­ply give a high-level overview here and dis­cuss each point in more de­tail in a sep­a­rate post.

There are sev­eral per­spec­tives here given that the ex­per­i­ment was de­signed to ex­plore mul­ti­ple rele­vant ideas, rather than test­ing a par­tic­u­lar, nar­row hy­poth­e­sis.

As a re­sult, the cur­rent de­sign is not op­ti­mis­ing very strongly for any of these pos­si­ble uses, and it is also plau­si­ble that its im­pact and effec­tive­ness will vary widely be­tween uses.

Per­spec­tives on impact

  • Miti­gat­ing ca­pac­ity bot­tle­necks. The effec­tive al­tru­ism and ra­tio­nal­ity com­mu­ni­ties face rather large bot­tle­necks in many ar­eas, such as al­lo­cat­ing fund­ing, del­e­gat­ing re­search, vet­ting tal­ent and re­view­ing con­tent. The cur­rent setup might provide a means of miti­gat­ing some of those—a scal­able mechanism of out­sourc­ing in­tel­lec­tual la­bor.

  • A way for in­tel­lec­tual tal­ent to build and demon­strate their skills. Even if this set-up can’t make new in­tel­lec­tual progress, it might be use­ful to have a venue where ju­nior re­searchers can demon­strate their abil­ity to pre­dict the con­clu­sions of se­nior re­searchers. This might provide an ob­jec­tive sig­nal of epistemic abil­ities not de­pen­dent on de­tailed so­cial knowl­edge.

  • Ex­plor­ing new in­sti­tu­tions for col­lab­o­ra­tive in­tel­lec­tual progress. Academia has a vast back­log of promis­ing ideas for in­sti­tu­tions to help us think bet­ter in groups. Cur­rently we seem bot­tle­necked by prac­ti­cal im­ple­men­ta­tion and product de­vel­op­ment.

  • Get­ting more data on em­piri­cal claims made by the Iter­ated Am­plifi­ca­tion AI al­ign­ment agenda. Th­ese ideas in­spired the ex­per­i­ment. (How­ever, our aim was more prac­ti­cal and short-term, rather than look­ing for the­o­ret­i­cal in­sights use­ful in the long-term.)

  • Ex­plor­ing fore­cast­ing with dis­tri­bu­tions. Lit­tle is known about hu­mans do­ing fore­cast­ing with full dis­tri­bu­tions rather than point es­ti­mates (e.g. “79%”), partly be­cause there hasn’t been easy tool­ing for such ex­per­i­ments. This ex­per­i­ment gave us some cheap data on this ques­tion.

  • Fore­cast­ing fuzzy things. A ma­jor challenge with fore­cast­ing tour­na­ments is the need to con­cretely spec­ify ques­tions; in or­der to clearly de­ter­mine who was right and al­lo­cate pay­outs. The cur­rent ex­per­i­ments tries to get the best of both wor­lds—the in­cen­tive prop­er­ties of fore­cast­ing tour­na­ments and the flex­i­bil­ity of gen­er­al­ist re­search in tack­ling more neb­u­lous ques­tions.

  • Shoot­ing for un­known un­knowns. In ad­di­tion to be­ing an “ex­per­i­ment”, this pro­ject is also an “ex­plo­ra­tion”. We have an in­tu­ition that there are in­ter­est­ing things to be dis­cov­ered at the in­ter­sec­tion of fore­cast­ing, mechanism de­sign, and gen­er­al­ist re­search. But we don’t yet know what they are.

Challenges and fu­ture experiments

  • Com­plex­ity and un­fa­mil­iar­ity of ex­per­i­ment. The cur­rent ex­per­i­ment had many tech­ni­cal mov­ing parts. This makes it challeng­ing to un­der­stand for both par­ti­ci­pants and po­ten­tial clients who want to use it in their own or­gani­sa­tions.

  • Trust in eval­u­a­tions. The ex­tent to which these re­sults are mean­ingful de­pends on your trust in Eliz­a­beth Van Nos­trand’s abil­ity to eval­u­ate ques­tions. We think is partly an in­escapable prob­lem, but also ex­pect clever mechanisms and more trans­parency to be able to make large im­prove­ments.

  • Cor­re­la­tions be­tween pre­dic­tions and eval­u­a­tions. Eliz­a­beth had ac­cess to a filtered ver­sion of fore­caster com­ments when she made her eval­u­a­tions. This in­tro­duces a po­ten­tial source of bias and a “self-fulfilling prophecy” dy­namic in the ex­per­i­ments.

  • Difficulty of con­vert­ing men­tal mod­els into quan­ti­ta­tive dis­tri­bu­tions. It’s hard to turn nu­anced men­tal mod­els into num­bers. We think a solu­tion is to have a “di­vi­sion of la­bor”, where some peo­ple just build mod­els/​write com­ments and oth­ers fo­cus on quan­tify­ing them. We’re work­ing on in­cen­tive schemes that work in this con­text.

  • Anti-cor­re­la­tion be­tween im­por­tance and “out­source­abil­ity”. The in­tel­lec­tual ques­tions which are most im­por­tant to an­swer might be differ­ent from the ones that are eas­iest to out­source, in a way which leaves very lit­tle value on the table in out­sourc­ing.

  • Over­head of ques­tion gen­er­a­tion. Creat­ing good fore­cast­ing ques­tions is hard and time-con­sum­ing, and bet­ter tool­ing is needed to sup­port this.

  • Overly com­pet­i­tive scor­ing rules. Pre­dic­tion mar­kets and tour­na­ments tend to be zero-sum games, with nega­tive in­cen­tives for helping other par­ti­ci­pants or shar­ing best prac­tices. To solve this we’re de­sign­ing and test­ing im­proved scor­ing rules which di­rectly in­cen­tivise col­lab­o­ra­tion.


[1] Ex­am­ples in­clude: AI al­ign­ment, global co­or­di­na­tion, macros­trat­egy and cause pri­ori­ti­sa­tion.

[2] We chose the in­dus­trial rev­olu­tion as a theme since it seems like a his­tor­i­cal pe­riod with many les­sons for im­prov­ing the world. It was a time of rad­i­cal change in pro­duc­tivity along with many so­cietal trans­for­ma­tions, and might hold les­sons for fu­ture trans­for­ma­tions and our abil­ity to in­fluence those.

[3] Some read­ers might also pre­fer the terms “in­te­gra­tion ex­per­i­ment” and “sand­box ex­per­i­ment”.

[4] In tra­di­tional fore­cast­ing tour­na­ments, par­ti­ci­pants state their be­liefs in a bi­nary event (e.g. “Will team X win this bas­ket­ball tour­na­ment?”) us­ing a num­ber be­tween 0% and 100%. This is referred to as a cre­dence, and it cap­tures their un­cer­tainty in a quan­ti­ta­tive way. The ter­minol­ogy comes from Bayesian prob­a­bil­ity the­ory, where ra­tio­nal agents are mod­el­led as as­sign­ing cre­dences to claims and then up­dat­ing those cre­dences on new in­for­ma­tion, in a way uniquely de­ter­mined by Bayes’ rule. How­ever, as a hu­man, we might not always be sure what the right cre­dence for a claim is. If I had an un­limited time to think, I might ar­rive at the right num­ber. (This is cap­tured by the “af­ter 10 more hours of re­search” claim.) But if I don’t have a lot of time, I have some un­cer­tainty about ex­actly how un­cer­tain I should be. This is re­flected in our use of dis­tri­bu­tions.

[5] In scal­ing the num­ber of claims be­yond what Eliz­a­beth can eval­u­ate, we would also have to pro­por­tion­ally in­crease the re­wards.

[6] Many of these par­ti­ci­pants had pre­vi­ous ex­pe­rience with fore­cast­ing, and some were “su­perfore­caster-equiv­a­lents” in terms of their skill. Others had less ex­pe­rience with fore­cast­ing but were com­pe­tent in quan­ti­ta­tive rea­son­ing. For fu­ture ex­per­i­ments, we ought to sur­vey par­ti­ci­pants about their pre­vi­ous ex­pe­rience.

[7] The pay­ments were dou­bled af­ter we had seen the re­sults, as the ini­tial scor­ing scheme proved too harsh on fore­cast­ers.

[8] The in­cen­tive schemes looked some­what differ­ent be­tween groups, mostly ow­ing to the fact that we tried to re­duce the com­plex­ity nec­es­sary to un­der­stand the ex­per­i­ment for the on­line crowd­work­ers, who to our knowl­edge had no prior ex­pe­rience with fore­cast­ing. They were each paid at a rate of ~$15 an hour, with the op­por­tu­nity for the top three fore­cast­ers to re­ceive a bonus of $35.

[9] Eliz­a­beth did this by copy­ing the claims into a google doc, num­ber­ing them, and then us­ing Google ran­dom num­ber gen­er­a­tor to pick claims. For a fu­ture scaled up ver­sion of the ex­per­i­ment, one could use the pub­lic ran­dom­ness bea­con as a trans­par­ent and re­pro­ducible way to sam­ple claims.

[10] In analysing the data we also plot­ted 95% con­fi­dence in­ter­vals by mul­ti­ply­ing the stan­dard er­ror by 1.96. In that graph the two lines in­ter­sect for some­thing like 80%-90% of the x-axis. You can plot and analyse them your­self here.

[11] We only dis­play the first four re­s­olu­tions to not too make up too much space (which were ran­domly cho­sen in the course of the ex­per­i­ment). All re­s­olu­tion graphs can be found here.

[12] The dis­tri­bu­tions are calcu­lated us­ing Monte Carlo sam­pling and Ker­nel smooth­ing, so are not perfectly smooth. This also led to er­rors around bounds be­ing out­side of the 0 to 100 range.

[13] For this ex­per­i­ment, Eliz­a­beth in­for­mally re­ports that the time saved ranged from 0-60 min­utes per ques­tion, but she did not keep the kind of notes re­quired to es­ti­mate an av­er­age.

[14] This is a rough model of calcu­lat­ing this and we can imag­ine there be­ing bet­ter ways of do­ing it. Sugges­tions are wel­come.

[15] Us­ing this trans­for­ma­tion al­lows us to vi­su­al­ise the fact smaller scores ob­tained later in the con­test can still be as im­pres­sive as ear­lier scores. For ex­am­ple, mov­ing from 90% con­fi­dence to 99% con­fi­dence takes roughly as much ev­i­dence as mov­ing from 50% to 90% con­fi­dence. Phrased in terms of odds ra­tios, both up­dates in­volve ev­i­dence of strength roughly 10:1.

Par­ti­ci­pate in fu­ture ex­per­i­ments or run your own

Fore­told.io was built as an open plat­form to en­able more ex­per­i­men­ta­tion with pre­dic­tion-re­lated ideas. We have also made data and anal­y­sis calcu­la­tions from this ex­per­i­ment pub­li­cly available.

If you’d like to:

  • Run your own ex­per­i­ments on other questions

  • Do ad­di­tional anal­y­sis on this ex­per­i­men­tal data

  • Use an am­plifi­ca­tion set-up within your organisation

We’d be happy to con­sider pro­vid­ing ad­vice, op­er­a­tional sup­port, and fund­ing for fore­cast­ers. Just com­ment here or reach out to this email.

If you’d like to par­ti­ci­pate as a fore­caster in fu­ture pre­dic­tion ex­per­i­ments, you can sign-up here.


Fund­ing for this pro­ject was pro­vided by the Berkeley Ex­is­ten­tial Risk Ini­ti­a­tive and the EA Long-term Fu­ture Fund.

We thank Beth Barnes and Owain Evans for helpful dis­cus­sion.

We are also very thank­ful to all the par­ti­ci­pants.