Evidence on good forecasting practices from the Good Judgment Project: an accompanying blog post

This is a linkpost re­pro­duc­ing this AI Im­pacts blog post. This post is it­self a longer and richer ver­sion of the con­cise sum­mary given in this AI Im­pacts page. Read­ers who don’t have time to read the post are en­couraged to read the page in­stead or at least first.

Figure 0: The “four main de­ter­mi­nants of fore­cast­ing ac­cu­racy.” 1

Ex­pe­rience and data from the Good Judg­ment Pro­ject (GJP) provide im­por­tant ev­i­dence about how to make ac­cu­rate pre­dic­tions. For a con­cise sum­mary of the ev­i­dence and what we learn from it, see this page. For a re­view of Su­perfore­cast­ing, the pop­u­lar book writ­ten on the sub­ject, see this blog.

This post ex­plores the ev­i­dence in more de­tail, draw­ing from the book, the aca­demic liter­a­ture, the older Ex­pert Poli­ti­cal Judg­ment book, and an in­ter­view with a su­perfore­caster. Read­ers are wel­come to skip around to parts that in­ter­est them:

1. The experiment

IARPA ran a fore­cast­ing tour­na­ment from 2011 to 2015, in which five teams plus a con­trol group gave prob­a­bil­is­tic an­swers to hun­dreds of ques­tions. The ques­tions were gen­er­ally about po­ten­tial geopoli­ti­cal events more than a month but less than a year in the fu­ture, e.g. “Will there be a vi­o­lent in­ci­dent in the South China Sea in 2013 that kills at least one per­son?” The ques­tions were care­fully cho­sen so that a rea­son­able an­swer would be some­where be­tween 10% and 90%.2 The fore­casts were scored us­ing the origi­nal Brier score—more on that in Sec­tion 2.3

The win­ning team was the GJP, run by Philip Tet­lock & Bar­bara Mel­lers. They re­cruited thou­sands of on­line vol­un­teers to an­swer IARPA’s ques­tions. Th­ese vol­un­teers tended to be males (83%) and US cit­i­zens (74%). Their av­er­age age was forty. 64% of re­spon­dents held a bach­e­lor’s de­gree, and 57% had post­grad­u­ate train­ing.4

GJP made their offi­cial pre­dic­tions by ag­gre­gat­ing and ex­trem­iz­ing the pre­dic­tions of their vol­un­teers.5They iden­ti­fied the top 2% of pre­dic­tors in their pool of vol­un­teers each year, dub­bing them “su­perfore­cast­ers,” and put them on teams in the next year so they could col­lab­o­rate on spe­cial fo­rums. They also ex­per­i­mented with a pre­dic­tion mar­ket, and they did a RCT to test the effect of a one-hour train­ing mod­ule on fore­cast­ing abil­ity. The mod­ule in­cluded con­tent about prob­a­bil­is­tic rea­son­ing, us­ing the out­side view, avoid­ing bi­ases, and more. At­tempts were made to find out which parts of the train­ing were most helpful—see Sec­tion 4.

2. The re­sults & their in­tu­itive meaning

Here are some of the key re­sults:

“In year 1 GJP beat the offi­cial con­trol group by 60%. In year 2, we beat the con­trol group by 78%. GJP also beat its uni­ver­sity-af­fili­ated com­peti­tors, in­clud­ing the Univer­sity of Michi­gan and MIT, by hefty mar­gins, from 30% to 70%.”6

“The Good Judg­ment Pro­ject out­performed a pre­dic­tion mar­ket in­side the in­tel­li­gence com­mu­nity, which was pop­u­lated with pro­fes­sional an­a­lysts who had clas­sified in­for­ma­tion, by 25 or 30 per­cent, which was about the mar­gin by which the su­perfore­cast­ers were out­perform­ing our own pre­dic­tion mar­ket in the ex­ter­nal world.”7

“Teams of or­di­nary fore­cast­ers beat the wis­dom of the crowd by about 10%. Pre­dic­tion mar­kets beat or­di­nary teams by about 20%. And [teams of su­perfore­cast­ers] beat pre­dic­tion mar­kets by 15% to 30%.”8“On av­er­age, teams were 23% more ac­cu­rate than in­di­vi­d­u­als.”9

What does Tet­lock mean when he says that one group did X% bet­ter than an­other? By ex­am­in­ing Table 4 (in Sec­tion 4) it seems that he means X% lower Brier score. What is the Brier score? For more de­tails, see the Wikipe­dia ar­ti­cle; ba­si­cally, it mea­sures the av­er­age squared dis­tance from the truth. This is why it’s bet­ter to have a lower Brier score—it means you were on av­er­age closer to the truth.10

Here is a bar graph of all the fore­cast­ers in Year 2, sorted by Brier score:11

For this set of ques­tions, guess­ing ran­domly (as­sign­ing even odds to all pos­si­bil­ities) would yield a Brier score of 0.53. So most fore­cast­ers did sig­nifi­cantly bet­ter than that. Some peo­ple—the peo­ple on the far left of this chart, the su­perfore­cast­ers—did much bet­ter than the av­er­age. For ex­am­ple, in year 2, the su­perfore­caster Doug Lorch did best with 0.14. This was more than 60% bet­ter than the con­trol group.12Im­por­tantly, be­ing a su­perfore­caster in one year cor­re­lated strongly with be­ing a su­perfore­caster the next year; there was some re­gres­sion to the mean but roughly 70% of the su­perfore­cast­ers main­tained their sta­tus from one year to the next.13

OK, but what does all this mean, in in­tu­itive terms? Here are three ways to get a sense of how good these scores re­ally are:

Way One: Let’s calcu­late some ex­am­ples of pre­dic­tion pat­terns that would give you Brier scores like those men­tioned above. Sup­pose you make a bunch of pre­dic­tions with 80% con­fi­dence and you are cor­rect 80% of the time. Then your Brier score would be 0.32, roughly mid­dle of the pack in this tour­na­ment. If in­stead it was 93% con­fi­dence cor­rect 93% of the time, your Brier score would be 0.132, very close to the best su­perfore­cast­ers and to GJP’s ag­gre­gated fore­casts.14 In these ex­am­ples, you are perfectly cal­ibrated, which helps your score—more re­al­is­ti­cally you would be im­perfectly cal­ibrated and thus would need to be right even more of­ten to get those scores.

Way Two: “An al­ter­na­tive mea­sure of fore­cast ac­cu­racy is the pro­por­tion of days on which fore­cast­ers’ es­ti­mates were on the cor­rect side of 50%. … For all ques­tions in the sam­ple, a chance score was 47%. The mean pro­por­tion of days with cor­rect es­ti­mates was 75%…”15 Ac­cord­ing to this chart, the su­perfore­cast­ers were on the right side of 50% al­most all the time:16

Way Three: “Across all four years of the tour­na­ment, su­perfore­cast­ers look­ing out three hun­dred days were more ac­cu­rate than reg­u­lar fore­cast­ers look­ing out one hun­dred days.”17 (Bear in mind, this wouldn’t nec­es­sar­ily hold for a differ­ent genre of ques­tions. For ex­am­ple, in­for­ma­tion about the weather de­cays in days, while in­for­ma­tion about the cli­mate lasts for decades or more.)

3. Cor­re­lates of good judgment

The data from this tour­na­ment is use­ful in two ways: It helps us de­cide whose pre­dic­tions to trust, and it helps us make bet­ter pre­dic­tions our­selves. This sec­tion will fo­cus on which kinds of peo­ple and prac­tices best cor­re­late with suc­cess—in­for­ma­tion which is rele­vant to both goals. Sec­tion 4 will cover the train­ing ex­per­i­ment, which helps to ad­dress cau­sa­tion vs. cor­re­la­tion wor­ries.

Feast your eyes on this:18

This shows the cor­re­la­tions be­tween var­i­ous things.19 The left­most column is the most im­por­tant; it shows how each vari­able cor­re­lates with (stan­dard­ized) Brier score. (Re­call that Brier scores mea­sure in­ac­cu­racy, so nega­tive cor­re­la­tions are good.)

It’s worth men­tion­ing that while in­tel­li­gence cor­re­lated with ac­cu­racy, it didn’t steal the show.20 The same goes for time spent de­liber­at­ing.21 The au­thors sum­ma­rize the re­sults as fol­lows: “The best fore­cast­ers scored higher on both in­tel­li­gence and poli­ti­cal knowl­edge than the already well-above-av­er­age group of fore­cast­ers. The best fore­cast­ers had more open-minded cog­ni­tive styles. They benefited from bet­ter work­ing en­vi­ron­ments with prob­a­bil­ity train­ing and col­lab­o­ra­tive teams. And while mak­ing pre­dic­tions, they spent more time de­liber­at­ing and up­dat­ing their fore­casts.”22

That big chart de­picts all the cor­re­la­tions in­di­vi­d­u­ally. Can we use them to con­struct a model to take in all of these vari­ables and spit out a pre­dic­tion for what your Brier score will be? Yes we can:

Figure 3. Struc­tural equa­tion model with stan­dard­ized co­effi­cients.

This model has a mul­ti­ple cor­re­la­tion of 0.64.23 Ear­lier, we noted that su­perfore­cast­ers typ­i­cally re­mained su­perfore­cast­ers (i.e. in the top 2%), prov­ing that their suc­cess wasn’t mostly due to luck. Across all the fore­cast­ers, the cor­re­la­tion be­tween perfor­mance in one year and perfor­mance in the next year is 0.65.24So we have two good ways to pre­dict how ac­cu­rate some­one will be: Look at their past perfor­mance, and look at how well they score on the struc­tural model above.

I spec­u­late that these cor­re­la­tions un­der­es­ti­mate the true pre­dictabil­ity of ac­cu­racy, be­cause the fore­cast­ers were all un­paid on­line vol­un­teers, and many of them pre­sum­ably had ran­dom things come up in their life that got in the way of mak­ing good pre­dic­tions—per­haps they have a kid, or get sick, or move to a new job and so stop read­ing the news for a month, and their ac­cu­racy de­clines.25 Yet still 70% of the su­perfore­cast­ers in one year re­mained su­perfore­cast­ers in the next.

Fi­nally, what about su­perfore­cast­ers in par­tic­u­lar? Is there any­thing to say about what it takes to be in the top 2%?

Tet­lock de­votes much of his book to this. It is hard to tell how much his recom­men­da­tions come from data anal­y­sis and how much are just his own syn­the­sis of the in­ter­views he’s con­ducted with su­perfore­cast­ers. Here is his “Por­trait of the modal su­perfore­caster.”26

Philo­sophic out­look:

  • Cau­tious: Noth­ing is cer­tain.

  • Hum­ble: Real­ity is in­finitely com­plex.

  • Non­de­ter­minis­tic: What­ever hap­pens is not meant to be and does not have to hap­pen.

Abil­ities & think­ing styles:

  • Ac­tively open-minded: Beliefs are hy­pothe­ses to be tested, not trea­sures to be pro­tected.

  • In­tel­li­gent and knowl­edge­able, with a “Need for Cog­ni­tion”: In­tel­lec­tu­ally cu­ri­ous, en­joy puz­zles and men­tal challenges.

  • Reflec­tive: In­tro­spec­tive and self-crit­i­cal.

  • Numer­ate: Com­fortable with num­bers.

Meth­ods of fore­cast­ing:

  • Prag­matic: Not wed­ded to any idea or agenda.

  • An­a­lyt­i­cal: Ca­pable of step­ping back from the tip-of-your-nose per­spec­tive and con­sid­er­ing other views.

  • Dragon­fly-eyed: Value di­verse views and syn­the­size them into their own.

  • Prob­a­bil­is­tic: Judge us­ing many grades of maybe.

  • Thought­ful up­daters: When facts change, they change their minds.

  • Good in­tu­itive psy­chol­o­gists: Aware of the value of check­ing think­ing for cog­ni­tive and emo­tional bi­ases.

Work ethic:

  • Growth mind­set: Believe it’s pos­si­ble to get bet­ter.

  • Grit: Deter­mined to keep at it how­ever long it takes.

Ad­di­tion­ally, there is ex­per­i­men­tal ev­i­dence that su­perfore­cast­ers are less prone to stan­dard cog­ni­tive sci­ence bi­ases than or­di­nary peo­ple.27 This is par­tic­u­larly ex­cit­ing be­cause—we can hope—the same sorts of train­ing that help peo­ple be­come su­perfore­cast­ers might also help over­come bi­ases.

Fi­nally, Tet­lock says that “The strongest pre­dic­tor of ris­ing into the ranks of su­perfore­cast­ers is per­pet­ual beta, the de­gree to which one is com­mit­ted to be­lief up­dat­ing and self-im­prove­ment. It is roughly three times as pow­er­ful a pre­dic­tor as its clos­est ri­val, in­tel­li­gence.”28 Un­for­tu­nately, I couldn’t find any sources or data on this, nor an op­er­a­tional defi­ni­tion of “per­pet­ual beta,” so we don’t know how he mea­sured it.29

4. The train­ing and Tet­lock’s commandments

This sec­tion dis­cusses the sur­pris­ing effect of the train­ing mod­ule on ac­cu­racy, and finishes with Tet­lock’s train­ing-mod­ule-based recom­men­da­tions for how to be­come a bet­ter fore­caster.30

The train­ing mod­ule, which was ran­domly given to some par­ti­ci­pants but not oth­ers, took about an hour to read.31 The au­thors de­scribe the con­tent as fol­lows:

“Train­ing in year 1 con­sisted of two differ­ent mod­ules: prob­a­bil­is­tic rea­son­ing train­ing and sce­nario train­ing. Sce­nario-train­ing was a four-step pro­cess: 1) de­vel­op­ing co­her­ent and log­i­cal prob­a­bil­ities un­der the prob­a­bil­ity sum rule; 2) ex­plor­ing and challeng­ing as­sump­tions; 3) iden­ti­fy­ing the key causal drivers; 4) con­sid­er­ing the best and worst case sce­nar­ios and de­vel­op­ing a sen­si­ble 95% con­fi­dence in­ter­val of pos­si­ble out­comes; and 5) avoid over-cor­rec­tion bi­ases. … Prob­a­bil­is­tic rea­son­ing train­ing con­sisted of les­sons that de­tailed the differ­ence be­tween cal­ibra­tion and re­s­olu­tion, us­ing com­par­i­son classes and base rates (Kah­ne­man & Tver­sky, 1973; Tver­sky & Kah­ne­man, 1981), av­er­ag­ing and us­ing crowd wis­dom prin­ci­ples (Surow­iecki, 2005), find­ing and uti­liz­ing pre­dic­tive math­e­mat­i­cal and statis­ti­cal mod­els (Arkes, 1981; Kah­ne­man & Tver­sky, 1982), cau­tiously us­ing time-se­ries and his­tor­i­cal data, and be­ing self-aware of the typ­i­cal cog­ni­tive bi­ases com­mon through­out the pop­u­la­tion.”32

In later years, they merged the two mod­ules into one and up­dated it based on their ob­ser­va­tions of the best fore­cast­ers. The up­dated train­ing mod­ule is or­ga­nized around an acronym:33

Im­pres­sively, this train­ing had a last­ing pos­i­tive effect on ac­cu­racy in all four years:

One might worry that train­ing im­proves ac­cu­racy by mo­ti­vat­ing the trainees to take their jobs more se­ri­ously. In­deed it seems that the trained fore­cast­ers made more pre­dic­tions per ques­tion than the con­trol group, though they didn’t make more pre­dic­tions over­all. Nev­er­the­less it seems that the train­ing also had a di­rect effect on ac­cu­racy as well as this in­di­rect effect.34

Mov­ing on, let’s talk about the ad­vice Tet­lock gives to his au­di­ence in Su­perfore­cast­ing, ad­vice which is based on, though not iden­ti­cal to, the CHAMPS-KNOW train­ing. The book has a few para­graphs of ex­pla­na­tion for each com­mand­ment, a tran­script of which is here; in this post I’ll give my own ab­bre­vi­ated ex­pla­na­tions:


(1) Triage: Don’t waste time on ques­tions that are “clock­like” where a rule of thumb can get you pretty close to the cor­rect an­swer, or “cloudlike” where even fancy mod­els can’t beat a dart-throw­ing chimp.

(2) Break seem­ingly in­tractable prob­lems into tractable sub-prob­lems: This is how Fermi es­ti­ma­tion works. One re­lated piece of ad­vice is “be wary of ac­ci­den­tally sub­sti­tut­ing an easy ques­tion for a hard one,” e.g. sub­sti­tut­ing “Would Is­rael be will­ing to as­sas­si­nate Yasser Arafat?” for “Will at least one of the tests for polo­nium in Arafat’s body turn up pos­i­tive?”

(3) Strike the right bal­ance be­tween in­side and out­side views: In par­tic­u­lar, first an­chor with the out­side view and then ad­just us­ing the in­side view. (More on this in Sec­tion 5)

(4) Strike the right bal­ance be­tween un­der- and over­re­act­ing to ev­i­dence: “Su­perfore­cast­ers aren’t perfect Bayesian pre­dic­tors but they are much bet­ter than most of us.”35 Usu­ally do many small up­dates, but oc­ca­sion­ally do big up­dates when the situ­a­tion calls for it. Take care not to fall for things that seem like good ev­i­dence but aren’t; re­mem­ber to think about P(E|H)/​P(E|~H); re­mem­ber to avoid the base-rate fal­lacy.

(5) Look for the clash­ing causal forces at work in each prob­lem: This is the “drag­on­fly eye per­spec­tive,” which is where you at­tempt to do a sort of men­tal wis­dom of the crowds: Have tons of differ­ent causal mod­els and ag­gre­gate their judg­ments. Use “Devil’s ad­vo­cate” rea­son­ing. If you think that P, try hard to con­vince your­self that not-P. You should find your­self say­ing “On the one hand… on the other hand… on the third hand…” a lot.

(6) Strive to dis­t­in­guish as many de­grees of doubt as the prob­lem per­mits but no more: Some peo­ple crit­i­cize the use of ex­act prob­a­bil­ities (67%! 21%!) as merely a way to pre­tend you know more than you do. There might be an­other post on the sub­ject of why cre­dences are bet­ter than hedge words like “maybe” and “prob­a­bly” and “sig­nifi­cant chance;” for now, I’ll sim­ply men­tion that when the au­thors rounded the su­perfore­caster’s fore­casts to the near­est 0.05, their ac­cu­racy dropped.36 Su­perfore­cast­ers re­ally were mak­ing use of all 101 num­bers from 0.00 to 1.00!

(7) Strike the right bal­ance be­tween un­der- and over­con­fi­dence, be­tween pru­dence and de­ci­sive­ness.

(8) Look for the er­rors be­hind your mis­takes but be­ware of rearview-mir­ror hind­sight bi­ases.

(9) Bring out the best in oth­ers and let oth­ers bring out the best in you: The book spent a whole chap­ter on this, us­ing the Wehrma­cht as an ex­tended case study on good team or­ga­ni­za­tion.37 One per­va­sive guid­ing prin­ci­ple is “Don’t tell peo­ple how to do things; tell them what you want ac­com­plished, and they’ll sur­prise you with their in­ge­nu­ity in do­ing it.” The other per­va­sive guid­ing prin­ci­ple is “Cul­ti­vate a cul­ture in which peo­ple—even sub­or­di­nates—are en­couraged to dis­sent and give coun­ter­ar­gu­ments.”38

(10) Master the er­ror-bal­anc­ing bi­cy­cle: This one should have been called prac­tice, prac­tice, prac­tice. Tet­lock says that read­ing the news and gen­er­at­ing prob­a­bil­ities isn’t enough; you need to ac­tu­ally score your pre­dic­tions so that you know how wrong you were.

(11) Don’t treat com­mand­ments as com­mand­ments: Tet­lock’s point here is sim­ply that you should use your judg­ment about whether to fol­low a com­mand­ment or not; some­times they should be over­rid­den.

It’s worth men­tion­ing at this point that the ad­vice is given at the end of the book, as a sort of sum­mary, and may make less sense to some­one who hasn’t read the book. In par­tic­u­lar, Chap­ter 5 gives a less for­mal but more helpful recipe for mak­ing pre­dic­tions, with ac­com­pa­ny­ing ex­am­ples. See the end of this blog post for a sum­mary of this recipe.

5. On the Out­side View & Les­sons for AI Impacts

The pre­vi­ous sec­tion sum­ma­rized Tet­lock’s ad­vice for how to make bet­ter fore­casts; my own sum­mary of the les­sons I think we should learn is more con­cise and com­pre­hen­sive and can be found at this page. This sec­tion goes into de­tail about one par­tic­u­lar, more con­tro­ver­sial mat­ter: The im­por­tance of the “out­side view,” also known as refer­ence class fore­cast­ing. This re­search pro­vides us with strong ev­i­dence in fa­vor of this method of mak­ing pre­dic­tions; how­ever, the situ­a­tion is com­pli­cated by Tet­lock’s in­sis­tence that other meth­ods are use­ful as well. This sec­tion dis­cusses the ev­i­dence and at­tempts to in­ter­pret it.

The GJP asked peo­ple who took the train­ing to self-re­port which of the CHAMPS-KNOW prin­ci­ples they were us­ing when they ex­plained why they made a fore­cast; 69% of fore­cast ex­pla­na­tions re­ceived tags this way. The only prin­ci­ple sig­nifi­cantly pos­i­tively cor­re­lated with suc­cess­ful fore­casts was C: Com­par­i­son classes.39 The au­thors take this as ev­i­dence that the out­side view is par­tic­u­larly im­por­tant. Anec­do­tally, the su­perfore­caster I in­ter­viewed agreed that refer­ence class fore­cast­ing was per­haps the most im­por­tant piece of the train­ing. (He also cred­ited the train­ing in gen­eral with helping him reach the ranks of the su­perfore­cast­ers.)

More­over, Tet­lock did an ear­lier, much smaller fore­cast­ing tour­na­ment from 1987-2003, in which ex­perts of var­i­ous kinds made the fore­casts.40 The re­sults were as­tound­ing: Many of the ex­perts did worse than ran­dom chance, and all of them did worse than sim­ple al­gorithms:

Figure 3.2, pul­led from Ex­pert Poli­ti­cal Judg­ment, is a gor­geous de­pic­tion of some of the main re­sults.41Tet­lock used some­thing very much like a Brier score in this tour­na­ment, but he broke it into two com­po­nents: “Discrim­i­na­tion” and “Cal­ibra­tion.” This graph plots the var­i­ous ex­perts and al­gorithms on the axes of dis­crim­i­na­tion and cal­ibra­tion. No­tice in the top right cor­ner the “For­mal mod­els” box. I don’t know much about the model used but ap­par­ently it was sig­nifi­cantly bet­ter than all of the hu­mans. This, com­bined with the fact that sim­ple case-spe­cific trend ex­trap­o­la­tions also beat all the hu­mans, is strong ev­i­dence for the im­por­tance of the out­side view.

So we should always use the out­side view, right? Well, it’s a bit more com­pli­cated than that. Tet­lock’s ad­vice is to start with the out­side view, and then ad­just us­ing the in­side view.42 He even goes so far as to say that hedge­hog­gery and sto­ry­tel­ling can be valuable when used prop­erly.

First, what is hedge­hog­gery? Re­call how the hu­man ex­perts fall on a rough spec­trum in Figure 3.2, with “hedge­hogs” get­ting the low­est scores and “foxes” get­ting the high­est scores. What makes some­one a hedge­hog or a fox? Their an­swers to these ques­tions.43 Tet­lock char­ac­ter­izes the dis­tinc­tion as fol­lows:

Low scor­ers look like hedge­hogs: thinkers who “know one big thing,” ag­gres­sively ex­tend the ex­plana­tory reach of that one big thing into new do­mains, dis­play bristly im­pa­tience with those who “do not get it,” and ex­press con­sid­er­able con­fi­dence that they are already pretty profi­cient fore­cast­ers, at least in the long term. High scor­ers look like foxes: thinkers who know many small things (tricks of their trade), are skep­ti­cal of grand schemes, see ex­pla­na­tion and pre­dic­tion not as de­duc­tive ex­er­cises but rather as ex­er­cises in flex­ible “ad ho­cery” that re­quire stitch­ing to­gether di­verse sources of in­for­ma­tion, and are rather diffi­dent about their own fore­cast­ing prowess, and … rather du­bi­ous that the cloudlike sub­ject of poli­tics can be the ob­ject of a clock­like sci­ence.44

Next, what is sto­ry­tel­ling? Us­ing your do­main knowl­edge, you think through a de­tailed sce­nario of how the fu­ture might go, and you tweak it to make it more plau­si­ble, and then you as­sign a cre­dence based on how plau­si­ble it seems. By it­self this method is un­promis­ing.45

De­spite this, Tet­lock thinks that sto­ry­tel­ling and hedge­hog­gery are valuable if han­dled cor­rectly. On hedge­hogs, Tet­lock says that hedge­hogs provide a valuable ser­vice by do­ing the deep think­ing nec­es­sary to build de­tailed causal mod­els and raise in­ter­est­ing ques­tions; these mod­els and ques­tions can then be slurped up by foxy su­perfore­cast­ers, eval­u­ated, and ag­gre­gated to make good pre­dic­tions.46 The su­perfore­caster Bill Flack is quoted in agree­ment.47 As for sto­ry­tel­ling, see these slides from Tet­lock’s edge.org sem­i­nar:

As the sec­ond slide in­di­cates, the idea is that we can some­times “fight fire with fire” by us­ing some sto­ries to counter other sto­ries. In par­tic­u­lar, Tet­lock says there has been suc­cess us­ing sto­ries about the past—about ways that the world could have gone, but didn’t—to “re­con­nect us to our past states of ig­no­rance.”48 The su­perfore­caster I in­ter­viewed said that it is com­mon prac­tice now on su­perfore­caster fo­rums to have a des­ig­nated “red team” with the ex­plicit mis­sion of find­ing counter-ar­gu­ments to what­ever the con­sen­sus seems to be. This, I take it, is an ex­am­ple of mo­ti­vated rea­son­ing be­ing put to good use.

More­over, ar­guably the out­side view sim­ply isn’t use­ful for some ques­tions.49 Peo­ple say this about lots of things—e.g. “The world is chang­ing so fast, so the cur­rent situ­a­tion in Syria is un­prece­dented and his­tor­i­cal av­er­ages will be use­less!”—and are proven wrong; for ex­am­ple, this re­search seems to in­di­cate that the out­side view is far more use­ful in geopoli­tics than peo­ple think. Nev­er­the­less, maybe it is true for some of the things we wish to pre­dict about ad­vanced AI. After all, a ma­jor limi­ta­tion of this data is that the ques­tions were mainly on geopoli­ti­cal events only a few years in the fu­ture at most. (Geopoli­ti­cal events seem to be some­what pre­dictable up to two years out but much more difficult to pre­dict five, ten, twenty years out.)50 So this re­search does not di­rectly tell us any­thing about the pre­dictabil­ity of the events AI Im­pacts is in­ter­ested in, nor about the use­ful­ness of refer­ence-class fore­cast­ing for those do­mains.51

That said, the fore­cast­ing best prac­tices dis­cov­ered by this re­search seem like gen­eral truth-find­ing skills rather than cheap hacks only use­ful in geopoli­tics or only use­ful for near-term pre­dic­tions. After all, geopoli­ti­cal ques­tions are them­selves a fairly di­verse bunch, yet ac­cu­racy on some was highly cor­re­lated with ac­cu­racy on oth­ers.52 So de­spite these limi­ta­tions I think we should do our best to imi­tate these best-prac­tices, and that means us­ing the out­side view far more than we would nat­u­rally be in­clined.

One fi­nal thing worth say­ing is that, re­mem­ber, the GJP’s ag­gre­gated judg­ments did at least as well as the best su­perfore­cast­ers.53 Pre­sum­ably at least one of the fore­cast­ers in the tour­na­ment was us­ing the out­side view a lot; af­ter all, half of them were trained in refer­ence-class fore­cast­ing. So I think we can con­clude that straight­for­wardly us­ing the out­side view as of­ten as pos­si­ble wouldn’t get you bet­ter scores than the GJP, though it might get you close for all we know. Anec­do­tally, it seems that when the su­perfore­cast­ers use the out­side view they of­ten ag­gre­gate be­tween differ­ent refer­ence-class fore­casts.54The wis­dom of the crowds is pow­er­ful; this is con­sis­tent with the wider liter­a­ture on the cog­ni­tive su­pe­ri­or­ity of groups, and the liter­a­ture on en­sem­ble meth­ods in AI.55

Tet­lock de­scribes how su­perfore­cast­ers go about mak­ing their pre­dic­tions.56 Here is an at­tempt at a sum­mary:

  1. Some­times a ques­tion can be an­swered more rigor­ously if it is first “Fermi-ized,” i.e. bro­ken down into sub-ques­tions for which more rigor­ous meth­ods can be ap­plied.

  2. Next, use the out­side view on the sub-ques­tions (and/​or the main ques­tion, if pos­si­ble). You may then ad­just your es­ti­mates us­ing other con­sid­er­a­tions (‘the in­side view’), but do this cau­tiously.

  3. Seek out other per­spec­tives, both on the sub-ques­tions and on how to Fermi-ize the main ques­tion. You can also gen­er­ate other per­spec­tives your­self.

  4. Re­peat steps 1 – 3 un­til you hit diminish­ing re­turns.

  5. Your fi­nal pre­dic­tion should be based on an ag­gre­ga­tion of var­i­ous mod­els, refer­ence classes, other ex­perts, etc.


  1. This graph can be found here, the GJP’s list of aca­demic liter­a­ture on this topic. The graph illus­trates ap­prox­i­mate rel­a­tive effects. It will be dis­cussed more in Sec­tion 2.

  2. This is from my con­ver­sa­tion with the su­perfore­caster.

  3. They did this so that they could in­clude oc­ca­sional non-bi­nary ques­tions. They show here that their re­sults are ro­bust to us­ing a log­a­r­ith­mic scor­ing rule in­stead.

  4. Th­ese statis­tics come from this study. The dataset ex­cludes in­di­vi­d­u­als who signed up but failed to reg­ister at least 25 pre­dic­tions in a given year.

  5. The ag­gre­ga­tion al­gorithm was elitist, mean­ing that it weighted more heav­ily fore­cast­ers with good track-records who had up­dated their fore­casts more of­ten. This de­scrip­tion of elitism comes from the web­page. In these slides Tet­lock de­scribes the elitism differ­ently: He says it gives weight to higher-IQ, more open-minded fore­cast­ers. The ex­trem­iz­ing step pushes the ag­gre­gated judg­ment closer to 1 or 0, to make it more con­fi­dent. The de­gree to which they ex­trem­ize de­pends on how di­verse and so­phis­ti­cated the pool of fore­cast­ers is. The aca­demic pa­pers on this topic can be found here and here. Whether ex­trem­iz­ing is a good idea is con­tro­ver­sial; ac­cord­ing to one ex­pert I in­ter­viewed, more re­cent data sug­gests that the suc­cesses of the ex­trem­iz­ing al­gorithm dur­ing the fore­cast­ing tour­na­ment were a fluke. After all, a pri­ori one would ex­pect ex­trem­iz­ing to lead to small im­prove­ments in ac­cu­racy most of the time, but big losses in ac­cu­racy some of the time.

  6. Su­perfore­cast­ing p18. On page 69: “Teams had to beat the com­bined fore­cast—the “wis­dom of the crowd”—of the con­tro group, and by mar­gins we all saw as in­timi­dat­ing. In the first year, IARPA wanted teams to beat that stan­dard by 20%—and it wanted that mar­gin of vic­tory to grow to 50% by the fourth year.” In light of this, it is es­pe­cially im­pres­sive that in­di­vi­d­ual su­perfore­cast­ers in the first two years beat the wis­dom-of-the-crowds-of-the-con­trol-group by ~60% and that the GJP beat it by 78%. (p72)

  7. Tran­script of this sem­i­nar.

  8. Su­perfore­cast­ing p207

  9. Su­perfore­cast­ing p201

  10. The best pos­si­ble Brier score is 0; the Brier score achieved by guess­ing ran­domly de­pends on which ver­sion of the score you use and how many pos­si­ble out­comes each pre­dic­tion chooses be­tween. For bi­nary pre­dic­tions, which con­sti­tuted the bulk of IARPA’s ques­tions, the origi­nal ver­sion of the Brier score is effec­tively twice the squared dis­tance from the truth, so always guess­ing 50% would yield a score of 0.5.

  11. This is from this study. The data cov­ers the first two years of the tour­na­ment.

  12. Su­perfore­cast­ing p93

  13. Su­perfore­cast­ing p104

  14. To calcu­late this, I as­sumed bi­nary ques­tions and plugged the prob­a­bil­ity, p, into this for­mula: P(event_doesn’t_hap­pen)(0-p)^2+P(event_hap­pens)(1-p)^2 = (1-p)(0-p)^2+(p)(1-p)^2. I then dou­bled it, since we are us­ing the origi­nal Brier score that ranges be­tween 0-2 in­stead of 0-1. I can’t find stats on GJP’s Brier score, but re­call that in year 2 it was 78% bet­ter than the con­trol group, and Doug Lorch’s 0.14 was 60% bet­ter than the con­trol group. (Su­perfore­cast­ing p93)

  15. This is from the same study, as are the two figures.

  16. The cor­re­la­tion be­tween av­er­age Brier score and how of­ten you were on the right side of 50% was 0.89 (same study), so I think it’s safe to as­sume the su­perfore­cast­ers were some­where on the right side of the peak in Figure 2. (I as­sume they mean be­ing on the right side of 50% cor­re­lates with lower Brier scores; the al­ter­na­tive is crazy.) The high pro­por­tion of guesses on the right side of 50% is a puz­zling fact—doesn’t it sug­gest that they were poorly cal­ibrated, and that they could im­prove their scores by ex­trem­iz­ing their judg­ments? I think what’s go­ing on here is that the ma­jor­ity of fore­casts made on most ques­tions by su­perfore­cast­ers were highly (>90%) con­fi­dent, and also al­most always cor­rect.

  17. Su­perfore­cast­ing p94, em­pha­sis mine. Later, in the edge.org sem­i­nar, Tet­lock says “In some other ROC curves—re­ceiver op­er­a­tor char­ac­ter­is­tic curves, from sig­nal de­tec­tion the­ory—that Mark Steyvers at UCSD con­structed—su­perfore­cast­ers could as­sign prob­a­bil­ities 400 days out about as well as reg­u­lar peo­ple could about eighty days out.” The quote is ac­com­panied by a graph; un­for­tu­nately, it’s hard to in­ter­pret.

  18. This table is from the same study.

  19. “Ravens” is an IQ test, “Numer­acy” is a math­e­mat­i­cal ap­ti­tude test.

  20. That said, as Carl Shul­man pointed out, the fore­cast­ers in this sam­ple were prob­a­bly above-av­er­age IQ, so the cor­re­la­tion be­tween IQ and ac­cu­racy in this sam­ple is al­most cer­tainly smaller than the “true” cor­re­la­tion in the pop­u­la­tion at large. See e.g. re­stric­tion of range and the Thorndike Cor­rec­tion.

  21. “De­liber­a­tion time, which was only mea­sured in Year 2, was trans­formed by a log­a­r­ith­mic func­tion (to re­duce tail effects) and av­er­aged over ques­tions. The av­er­age length of de­liber­a­tion time was 3.60 min, and the av­er­age num­ber of ques­tions tried through­out the 2-year pe­riod was 121 out of 199 (61% of all ques­tions). Cor­re­la­tions be­tween stan­dard­ized Brier score ac­cu­racy and effort were statis­ti­cally sig­nifi­cant for be­lief up­dat­ing, … and de­liber­a­tion time, … but not for num­ber of fore­cast­ing ques­tions at­tempted.” (study) Anec­do­tally, I spoke to a su­perfore­caster who said that the best of the best typ­i­cally put a lot of time into it; he spends maybe fif­teen min­utes each day mak­ing pre­dic­tions but sev­eral hours per day read­ing news, listen­ing to rele­vant pod­casts, etc.

  22. This is from the same study

  23. “Nonethe­less, as we saw in the struc­tural model, and con­firm here, the best model uses dis­po­si­tional, situ­a­tional, and be­hav­ioral vari­ables. The com­bi­na­tion pro­duced a mul­ti­ple cor­re­la­tion of .64.” (study) Yel­low ovals are la­tent dis­po­si­tional vari­ables, yel­low rec­t­an­gles are ob­served dis­po­si­tional vari­ables, pink rec­t­an­gles are ex­per­i­men­tally ma­nipu­lated situ­a­tional vari­ables, and green rec­t­an­gles are ob­served be­hav­ioral vari­ables. If this di­a­gram fol­lows con­ven­tion, sin­gle-headed ar­rows rep­re­sent hy­poth­e­sized cau­sa­tion, whereas the dou­ble-headed ar­row rep­re­sents a cor­re­la­tion with­out any claim be­ing made about cau­sa­tion.

  24. Su­perfore­cast­ing p104

  25. Of course, these things can hap­pen in the real world too—maybe our AI timelines fore­cast­ers will get sick and stop mak­ing good fore­casts. What I’m sug­gest­ing is that this data is in­her­ently nois­ier than data from a group of full-time staff whose job it is to pre­dict things would be. More­over, when these things hap­pen in the real world, we can see that they are hap­pen­ing and ad­just our model ac­cord­ingly, e.g. “Bob’s re­ally busy with kids this month, so let’s not lean as heav­ily on his fore­casts as we usu­ally do.”

  26. Su­perfore­cast­ing p191

  27. From edge.org: Mel­lers: “We have given them lots of Kah­ne­man and Tver­sky-like prob­lems to see if they fall prey to the same sorts of bi­ases and er­rors. The an­swer is sort of, some of them do, but not as many. It’s not nearly as fre­quent as you see with the rest of us or­di­nary mor­tals. The other thing that’s in­ter­est­ing is they don’t make the kinds of mis­takes that reg­u­lar peo­ple make in­stead of the right an­swer. They do some­thing that’s a lit­tle bit more thought­ful. They in­te­grate base rates with case-spe­cific in­for­ma­tion a lit­tle bit more.”
    Tet­lock: “They’re closer to Bayesi­ans.”
    Mel­lers: “Right. They’re a lit­tle less sen­si­tive to fram­ing effects. The refer­ence point doesn’t have quite the enor­mous role that it does with most peo­ple.”

  28. Su­perfore­cast­ing p192

  29. More­over, a quick search through Google Scholar and library.unc.edu turned up noth­ing of in­ter­est. I reached out to Tet­lock to ask ques­tions but he hasn’t re­sponded yet.

  30. “The guidelines sketched here dis­till key themes in this book and in train­ing sys­tems that have been ex­per­i­men­tally demon­strated to boost ac­cu­racy in real-world fore­cast­ing tour­na­ments.” (277)

  31. This is from this study. Rele­vant quote: “Although the train­ing lasted less than one hour, it con­sis­tently im­proved ac­cu­racy (Brier scores) by 6 to 11% over the con­trol con­di­tion.”

  32. Same study.

  33. Same study.

  34. See sec­tions 3.3, 3.5, and 3.6 of this study.

  35. Su­perfore­cast­ing p281

  36. This is from Fried­man et al (2018), available here.

  37. Scott Alexan­der: “Later in the chap­ter, he ad­mits that his choice of ex­am­ples might raise some eye­brows, but says that he did it on pur­pose to teach us to think crit­i­cally and over­come cog­ni­tive dis­so­nance be­tween our moral pre­con­cep­tions and our fac­tual be­liefs. I hope he has tenure.”

  38. See e.g. page 284 of Su­perfore­cast­ing, and the en­tirety of chap­ter 9.

  39. This is from this pa­per. One worry I have about it is that an­other prin­ci­ple, P, was strongly as­so­ci­ated with in­ac­cu­racy, but the au­thors ex­plain this away by say­ing that “Post-mortem analy­ses,” the P’s, are nat­u­rally done usu­ally af­ter bad fore­casts. This makes me won­der if a similar ex­pla­na­tion could be given for the suc­cess of the C’s: Ques­tions for which a good refer­ence class ex­ists are eas­ier than oth­ers.

  40. The re­sults and con­clu­sions from this tour­na­ment can be found in the re­sult­ing book, Ex­pert Poli­ti­cal Judg­ment: How good is it? How can we know? See p242 for a de­scrip­tion of the method­ol­ogy and dates.

  41. Page 77.

  42. Su­perfore­cast­ing p120

  43. For the data on how these ques­tions were weighted in de­ter­min­ing foxy­ness, see Ex­pert Poli­ti­cal Judg­ment p74

  44. Ex­pert Poli­ti­cal Judg­ment p75

  45. There are sev­eral rea­sons to worry about this method. For one, it’s not what foxes do, and foxes score bet­ter than hedge­hogs. Tet­lock also says it’s not what su­perfore­cast­ers do. More in­sight­fully, Tet­lock says we are bi­ased to as­sign more prob­a­bil­ity to more vivid and in­ter­est­ing sto­ries, and as a re­sult it’s easy for your prob­a­bil­ities to sum to much more than 1. Anec­dote: I was an­swer­ing a se­ries of “Prob­a­bil­ity of ex­tinc­tion due to cause X” ques­tions on Me­tac­u­lus, and I soon re­al­ized that my num­bers were go­ing to add up to more than 100%, so I had to ad­just them all down sys­tem­at­i­cally to make room for the last few kinds of dis­aster on the list. If I hadn’t been as­sign­ing ex­plicit prob­a­bil­ities, I wouldn’t have no­ticed the er­ror. And if I hadn’t gone through the whole list of pos­si­bil­ities, I would have come away with an un­jus­tifi­ably high cre­dence in the few I had con­sid­ered.

  46. Su­perfore­cast­ing p266. This is rem­i­nis­cent of Yud­kowsky’s per­spec­tive on what is es­sen­tially this same de­bate.

  47. Su­perfore­cast­ing p271.

  48. Same sem­i­nar.

  49. For ex­am­ple, see Yud­kowsky: “Where two sides dis­agree, this can lead to refer­ence class ten­nis—both par­ties get stuck in­sist­ing that their own “out­side view” is the cor­rect one, based on di­verg­ing in­tu­itions about what similar­i­ties are rele­vant. If it isn’t clear what the set of “similar his­tor­i­cal cases” is, or what con­clu­sions we should draw from those cases, then we’re forced to use an in­side view—think­ing about the causal pro­cess to dis­t­in­guish rele­vant similar­i­ties from ir­rele­vant ones. You shouldn’t avoid out­side-view-style rea­son­ing in cases where it looks likely to work, like when plan­ning your Christ­mas shop­ping. But in many con­texts, the out­side view sim­ply can’t com­pete with a good the­ory.”

  50. Tet­lock ad­mits that “there is no ev­i­dence that geopoli­ti­cal or eco­nomic fore­cast­ers can pre­dict any­thing ten years out be­yond the ex­cru­ci­at­ingly ob­vi­ous… Th­ese limits on pre­dictabil­ity are the pre­dictable re­sults of the but­terfly dy­nam­ics of non­lin­ear sys­tems. In my EPJ re­search, the ac­cu­racy of ex­pert pre­dic­tions de­clined to­ward chance five years out.” (Su­perfore­cast­ing p243) I highly recom­mend the graphic on that page, by the way, also available here: “Thoughts for the 2001 Qua­dren­nial Defense Re­view.”

  51. The su­perfore­caster I in­ter­viewed spec­u­lated that pre­dict­ing things like the con­tinued drop in price of com­put­ing hard­ware or so­lar pan­els is fairly easy, but that pre­dict­ing the ap­pear­ance of new tech­nolo­gies is very difficult. Tet­lock has ideas for how to han­dle longer-term, neb­u­lous ques­tions. He calls it “Bayesian Ques­tion Clus­ter­ing.” (Su­perfore­cast­ing 263) The idea is to take the ques­tion you re­ally want to an­swer and look for more pre­cise ques­tions that are ev­i­den­tially rele­vant to the ques­tion you care about. Tet­lock in­tends to test the effec­tive­ness of this idea in fu­ture re­search.

  52. “There are sev­eral ways to look for in­di­vi­d­ual con­sis­tency across ques­tions. We sorted ques­tions on the ba­sis of re­sponse for­mat (bi­nary, mult­i­no­mial, con­di­tional, or­dered), re­gion (Eur­zone, Latin Amer­ica, China, etc.), and du­ra­tion of ques­tion (short, medium, and long). We com­puted ac­cu­racy scores for each in­di­vi­d­ual on each vari­able within each set (e.g., bi­nary, mult­i­no­mial, con­di­tional, and or­dered) and then con­structed cor­re­la­tion ma­tri­ces. For all three ques­tion types, cor­re­la­tions were pos­i­tive… Then we con­ducted fac­tor analy­ses. For each ques­tion type, a large pro­por­tion of the var­i­ance was cap­tured by a sin­gle fac­tor, con­sis­tent with the hy­poth­e­sis that one un­der­ly­ing di­men­sion was nec­es­sary to cap­ture cor­re­la­tions among re­sponse for­mats, re­gions, and ques­tion du­ra­tion.” (from this study)

  53. I haven’t found this said ex­plic­itly, but I in­fer this from Doug Lorch, the best su­perfore­caster in Year 2, beat­ing the con­trol group by at least 60% when the GJP beat the con­trol group by 78%. (Su­perfore­cast­ing 93, 18) That said, page 72 seems to say that in Year 2 ex­actly one per­son—Doug Lorch—man­aged to beat the ag­gre­ga­tion al­gorithm. This is al­most a con­tra­dic­tion; I’m not sure what to make of it. At any rate, it seems that the ag­gre­ga­tion al­gorithm pretty re­li­ably does bet­ter than the su­perfore­cast­ers in gen­eral, even if oc­ca­sion­ally one of them beats it.

  54. This is on page 304. Another ex­am­ple on 313.

  55. For more on these, see this page.

  56. This is my sum­mary of Tet­lock’s ad­vice in Chap­ter 5: “Ul­ti­mately, it’s not the num­ber crunch­ing power that counts. It’s how you use it. … You’ve Fermi-ized the ques­tion, con­sulted the out­side view, and now, fi­nally, you can con­sult the in­side view … So you have an out­side view and an in­side view. Now they have to be merged. …”