Fireside Chat with Philip Tetlock

Philip Tet­lock is an ex­pert on fore­cast­ing. He’s spent decades study­ing how peo­ple make pre­dic­tions — from poli­ti­cal pun­dits to CIA an­a­lysts. In this con­ver­sa­tion with Nathan Labenz, he dis­cusses a wide range of top­ics, in­clud­ing pre­dic­tion al­gorithms, the long-term fu­ture, and his Good Judg­ment Pro­ject (which iden­ti­fied com­mon traits of the most skil­led fore­cast­ers and fore­cast­ing teams).

Below is a tran­script of the EA Global con­ver­sa­tion be­tween Tet­lock and Nathan Labenz, which we have ed­ited lightly for clar­ity. You can also watch this talk on YouTube or read its tran­script on effec­tivealtru­ism.org.


Labenz: It’s ex­cit­ing to have you here. A lot of au­di­ence mem­bers will be fa­mil­iar with your work and its im­pact on the EA com­mu­nity. But about half are at­tend­ing EA Global for the first time, and may not be as fa­mil­iar with your re­search.

We’re go­ing to cor­rect that right now. To start off, let’s go back in time, break down your re­search into a few chap­ters, and talk about each one to provide a short in­tro­duc­tion. Then, we will turn to­ward top­ics that are [timely] and of prac­ti­cal in­ter­est to the EA com­mu­nity.

Your first break­through book was called Ex­pert Poli­ti­cal Judg­ment. In it, you sum­ma­rized a sys­tem­atic ap­proach to eval­u­at­ing ex­pert poli­ti­cal judg­ment. Maybe you could start by shar­ing your prin­ci­pal con­clu­sions in do­ing that re­search.

Tet­lock: Sure. The book’s full ti­tle is Ex­pert Poli­ti­cal Judg­ment: How Good Is It? How Can We Know? The lat­ter two parts are crit­i­cal.

I be­came in­volved in fore­cast­ing tour­na­ments a long time ago. It’s hard to talk about it with­out re­veal­ing how old I am. I had just got­ten tenure at [the Univer­sity of Cal­ifor­nia,] Berkeley. It was 1984 and I was look­ing for some­thing mean­ingful to do with my life. There was a large, ac­rimo­nious de­bate go­ing on at the time about where the Soviet Union was head­ing. Its old gen­er­a­tion of lead­er­ship was dy­ing off: Brezh­nev, An­dropov, and Ch­er­nenko. Gor­bachev was just about to as­cend to the role of Gen­eral Sec­re­tary in the Polit­buro.

It’s hard to re­cap­ture states of ig­no­rance from the past. The past is an­other land. But trust me on this: The liber­als, by and large, were very con­cerned that the Rea­gan Ad­minis­tra­tion was go­ing to es­ca­late the con­flict with the Soviet Union to the point of a nu­clear con­fronta­tion, or a nu­clear war, for that mat­ter. There was a nu­clear freeze move­ment and a great deal of apoc­a­lyp­tic talk. Con­ser­va­tives’ views were in line with those of Jeanne Kirk­patrick, am­bas­sador to the United Na­tions un­der Rea­gan, who be­lieved the Soviet Union was an in­fal­libly self-re­pro­duc­ing to­tal­i­tar­ian sys­tem that wasn’t go­ing to change.

Nei­ther side proved to be ac­cu­rate about Gor­bachev. Liber­als were con­cerned that Rea­gan was driv­ing the Soviets into neo-Stal­inist re­trench­ment. Con­ser­va­tives thought, “We’ll be able to con­tain and de­ter [the Soviets], but they’re not go­ing to change.” No­body ex­pected that Gor­bachev would in­sti­tute re­forms as rad­i­cal as the ones that he did — re­forms that ul­ti­mately cul­mi­nated in the dis­in­te­gra­tion of the Soviet Union six years later. No­body was even [close to be­ing] right. Nonethe­less, vir­tu­ally ev­ery­body felt that, in some fash­ion, they were. Con­ser­va­tives re­for­mu­lated the events. They said, “We won the cold war. We drove [the Soviets] to trans­form.” Liber­als said, “It would have hap­pened any­way. Their econ­omy was crum­bling and they rec­og­nized that, as Gor­bachev once said to She­vard­nadze, ‘We can’t go on liv­ing this way.’”

Each side was well-po­si­tioned to ex­plain, with al­most 2020 hind­sight, what vir­tu­ally ev­ery­body had been un­able to see prospec­tively. It was this asym­me­try be­tween hind­sight and fore­sight that led me to be­lieve it would be in­ter­est­ing to track ex­pert poli­ti­cal judg­ment over time.

The first peo­ple we stud­ied were Sovie­tol­o­gists. I had some col­leagues at UC, Berkeley and el­se­where who helped with that work. There was also some foun­da­tion sup­port. We’ve grad­u­ally been able to cob­ble to­gether in­creas­ingly large fore­cast­ing tour­na­ments. And, us­ing big col­lec­tions of data from 1988 and 1992, the book came out in 2005. The up­shot was that sub­ject-mat­ter ex­perts don’t know as much as they think they do about the fu­ture. They are mis­cal­ibrated. It’s difficult to gen­er­al­ize, but when [ex­perts] say they are 85% con­fi­dent [in a pre­dic­tion], it hap­pens roughly 65% or 70% of the time. Some of them are con­sid­er­ably more over­con­fi­dent than that.

I think this led to the mis­con­cep­tion that I think sub­ject-mat­ter ex­perts are use­less. I don’t be­lieve that. But there are some cat­e­gories of ques­tions in which sub­ject-mat­ter ex­perts don’t know as much as they think they do, and other cat­e­gories in which they have a hard time do­ing bet­ter than min­i­mal­ist bench­marks. Ex­am­ples in­clude sim­ple ex­trap­o­la­tion al­gorithms and — to name an anal­ogy that par­tic­u­larly bugs some of them — the dart-toss­ing chim­panzee [i.e., the no­tion that chim­panzees who make pre­dic­tions by ran­domly throw­ing darts at a set of op­tions fare as well or bet­ter than ex­perts in some ar­eas].

Labenz: [To re­cap,] prin­ci­pal find­ings of that re­search were that peo­ple are over­con­fi­dent, and that there’s much less to be gained from ex­per­tise, in terms of the abil­ity to make ac­cu­rate fore­casts, than ex­perts would have thought.

Tet­lock: Yes. I re­mem­ber one of the in­spira­tions for this work. When [No­bel Prize-win­ning be­hav­ioral economist] Daniel Kah­ne­man came to Berkeley in 1988, and I de­scribed the pro­ject to him at a Chi­nese restau­rant, he had a clever line. He said that the av­er­age ex­pert in my stud­ies would have a hard time out­perform­ing an at­ten­tive reader of the New York Times.

That is, more or less, an ac­cu­rate sum­mary of the data. Ed­u­cated dilet­tantes who are con­ver­sant in pub­lic af­fairs are not ap­pre­cia­bly less ac­cu­rate than peo­ple who are im­mersed in the sub­ject do­main.

That re­ally both­ered a lot of peo­ple. Still, I think that find­ing doesn’t give us li­cense to ig­nore sub­ject-mat­ter ex­perts. I’d be glad to talk in more de­tail later about what I think sub­ject-mat­ter ex­perts are good for.

Labenz: From there, you take that startling ob­ser­va­tion that the at­ten­tive reader of the _New York Times_ can at least com­pete with ex­perts and you get in­volved in a fore­cast­ing tour­na­ment put on by IARPA [In­tel­li­gence Ad­vanced Re­search Pro­jects Ac­tivity]. Some­time around 2011, I joined your team. That ul­ti­mately led to the book Su­perfore­cast­ing.

Tell us about the struc­tures you cre­ated so that peo­ple like me could sign onto a web­site [and par­ti­ci­pate as fore­cast­ers]. By the way, you did a great job on the web­site.

Tet­lock: I had noth­ing to do with that [the web­site]. There are so many crit­i­cal things I had noth­ing to do with — it’s a very long list. [Tet­lock laughs.]

Labenz: Put­ting a team to­gether was cer­tainly a huge part of it. You cre­ated a struc­ture with differ­ent ex­per­i­men­tal con­di­tions: some peo­ple work by them­selves, some are in teams, some have differ­ent train­ing. Could you walk us through the Good Judg­ment Pro­ject team that ul­ti­mately fueled the book Su­perfore­cast­ing?

Tet­lock: The pro­ject was very much due to the work of some­one who is linked to the EA com­mu­nity: Ja­son Ma­theny of IARPA, a re­search and de­vel­op­ment branch of the in­tel­li­gence com­mu­nity that funded these fore­cast­ing tour­na­ments on a scale that was far more lav­ish than any­thing to which I’d been ac­cus­tomed.

I di­vide the work into two lines. The ear­lier line of work is about curs­ing the dark­ness. The later line of work is more about light­ing can­dles. The ear­lier line of work doc­u­mented that sub­ject-mat­ter ex­perts fall prey to many of the clas­sic cog­ni­tive bi­ases in the judg­ment de­ci­sion-mak­ing liter­a­ture; they are over­con­fi­dent, as we just dis­cussed. Also, they don’t change their minds as much as they should, and in re­sponse to con­tra­dic­tory ev­i­dence are sus­cep­ti­ble to hind­sight bias . Some­times their prob­a­bil­ity judg­ments are in­co­her­ent. There are a lot of cog­ni­tive bi­ases that were doc­u­mented in Ex­pert Poli­ti­cal Judg­ment. That’s why I [de­scribe the work as] “curs­ing the dark­ness.”

The su­perfore­cast­ing pro­ject was much more of a challenge. How smart can peo­ple be­come if you throw ev­ery­thing you know at the prob­lem — the whole kitchen sink? It wasn’t a sub­tle psy­cholog­i­cal ex­per­i­ment in which you try to pro­duce a big effect with a lit­tle nudge. It was a very con­certed effort to make peo­ple smarter and faster at as­sign­ing prob­a­bil­ity judg­ments to achieve a tan­gible goal. That is when the fore­cast­ing tour­na­ment that IARPA spon­sored took place. There were a num­ber of big re­search teams that com­peted, and we were one of them. It was a “light­ing the can­dles” pro­ject, and we threw ev­ery­thing we could at it to help peo­ple perform at their op­ti­mum level.

There were four cat­e­gories of strate­gies that worked pretty well, which we fine-tuned over time:

1. Select­ing the right types of fore­cast­ers, with the right cog­ni­tive abil­ity and cog­ni­tive style pro­files.
2. Pro­vid­ing var­i­ous forms of prob­a­bil­is­tic rea­son­ing train­ing and de­bi­as­ing ex­er­cises.
3. Fa­cil­i­tat­ing bet­ter team­work.
4. Us­ing bet­ter ag­gre­ga­tion al­gorithms.

Each of those cat­e­gories played a ma­jor role.

Labenz: Let’s talk briefly about each one. First, who are the right peo­ple? That is, what are the cog­ni­tive pro­files of the most suc­cess­ful fore­cast­ers?

Tet­lock: I wish I had a slide that could show you Raven’s Pro­gres­sive Ma­tri­ces. It is a dev­il­ishly tricky test that was de­vel­oped in the 1930s and is a clas­sic mea­sure of fluid in­tel­li­gence — your abil­ity to en­gage in com­plex pat­tern recog­ni­tion and hy­poth­e­sis test­ing. It was used by the U.S. Air Force to iden­tify farm boys who had the po­ten­tial to be­come pi­lots, but couldn’t read. There’s no [lin­guis­tic] in­ter­face. You sim­ply look at it and de­ter­mine whether a pat­tern fulfills the re­quire­ments. It has noth­ing to do with poli­tics or lan­guage. And it proved to be an im­por­tant fac­tor; fluid in­tel­li­gence is not to be un­der­es­ti­mated.

That doesn’t mean ev­ery­body has to have an IQ of 150 in or­der to be a su­perfore­caster. They don’t. But it does help to be at least a stan­dard de­vi­a­tion above the norm in fluid in­tel­li­gence.

The other fac­tors had more to do with your style of think­ing and how you think about think­ing; they didn’t have to do with raw crunch­ing power. Ac­tive open-mind­ed­ness — in the ear­lier work we called it “fox­i­ness” — is your will­ing­ness to treat your be­liefs as testable hy­pothe­ses and prob­a­bil­is­tic as­ser­tions, not dog­matic cer­tain­ties. If you’re will­ing to do that, it’s a good sign that you’re in­ter­ested in be­com­ing more gran­u­lar.

There’s an old joke in [the field of] judg­ment de­ci­sion-mak­ing: Deep down, peo­ple can only dis­t­in­guish be­tween three de­grees of cer­tainty: yes, no, and maybe. But to be a re­ally good fore­caster in these tour­na­ments, you needed to be gran­u­lar. You needed to be like a world-class poker player. You needed to know the differ­ence be­tween a 6040 bet and a 4060 bet, or a 5545 bet and a 4555 bet. Depend­ing upon the do­main, you can be­come very gran­u­lar. Some do­mains are more gran­u­lar than oth­ers, but you can al­most always do bet­ter than just “yes, no, or maybe.” The worst fore­cast­ers were more in the “yes, no, maybe” zone. Their judg­ments are ei­ther very close to zero, very close to one, or very close to 0.5. It was all bi­nary [for them].

So, [we sought] peo­ple who had rea­son­able scores in fluid in­tel­li­gence and who were ac­tively open-minded. The fi­nal fac­tor was just a mat­ter of cu­ri­os­ity and a will­ing­ness to give things a try. The kinds of ques­tions IARPA was ask­ing us in this tour­na­ment were about the Syr­ian Civil War, and whether Greece was go­ing to leave the Euro­zone, and what Rus­sia was go­ing to do in Crimea. There were ques­tions about ar­eas all over the world: the South China Sea, North Korea, Ger­many. There were ques­tions about Span­ish bond yield spreads. It was an in­cred­ible, mis­cel­la­neous, hodge­podge of ques­tions.

Some peo­ple would say to us, “Th­ese are unique events. There’s no way you’re go­ing to be able to put prob­a­bil­ities on such things. You need dis­tri­bu­tions. It’s not go­ing to work.” If you adopt that at­ti­tude, it doesn’t re­ally mat­ter how high your fluid in­tel­li­gence is. You’re not go­ing to be able to get bet­ter at fore­cast­ing, be­cause you’re not go­ing to take it se­ri­ously. You’re not go­ing to try. You have to be will­ing to give it a shot and say, “You know what? I think I’m go­ing to put some men­tal effort into con­vert­ing my vague hunches into prob­a­bil­ity judg­ments. I’m go­ing to keep track of my scores, and I’m go­ing to see whether I grad­u­ally get bet­ter at it.” The peo­ple who per­sisted tended to be­come su­perfore­cast­ers.

Labenz: That is es­sen­tially a Bayesian ap­proach. You don’t need a large set of po­ten­tial out­comes in or­der to be able to make a pre­dic­tion, be­cause you have your pri­ors and, as you said, you are will­ing to change your mind and up­date your be­liefs ac­cord­ingly.

This is a bit of an aside, but what is hap­pen­ing when peo­ple take a col­lec­tion of facts and then spit out a num­ber? Do you have any in­sight, for ex­am­ple, into how peo­ple choose be­tween 0.65 and 0.7 when they’re try­ing to as­sess the like­li­hood of some­thing like Bashar al-As­sad stay­ing in power for a cer­tain amount of time?

Tet­lock: It’s very con­text-spe­cific. Most of the news is pretty gran­u­lar. What’s the prob­a­bil­ity of Trump be­ing re­elected? Maybe it’s be­tween 35% and 55%. But then there’s a gaffe or a scan­dal and [the odds might] go down a bit. Maybe it ap­pears that Bi­den will be [Trump’s op­po­nent] and [Trump could have] a harder time defeat­ing Bi­den than some­one else, so Trump’s chances go down. Var­i­ous things hap­pen and you ad­just.

One of the in­ter­est­ing things about the best fore­cast­ers is they’re not ex­ces­sively volatile. There’s an in­ter­est­ing di­vi­sion in the field of judg­ment de­ci­sion-mak­ing be­tween those who say peo­ple are too slow to change their minds — too cog­ni­tively con­ser­va­tive — and those who say peo­ple are ex­ces­sively volatile. That sec­ond group point to the stock mar­ket. Robert Shiller is fa­mous for mak­ing that ar­gu­ment.

Both views are true. Some peo­ple are ex­ces­sively con­ser­va­tive some­times and, at other times, ex­ces­sively jumpy. The path to do­ing well in fore­cast­ing tour­na­ments is to avoid both er­rors. [When we train fore­cast­ers,] the sec­ond prong of our ap­proach is about helping peo­ple en­gage in what we call an “er­ror bal­anc­ing pro­cess.” We don’t just ham­mer away at them if they are over­con­fi­dent. Push­ing peo­ple down just leads to un­der­con­fi­dence. They need to be aware of the risk of both and ap­pre­ci­ate that it’s a bal­anc­ing act. We try to sen­si­tize peo­ple to the con­flict­ing er­rors to which we’re all sus­cep­ti­ble when we try to make sense of a wickedly com­plex, en­vi­ron­ment.

Labenz: How do you en­courage peo­ple to work in teams? And how did the best al­gorithm teams work to­gether? What made a great team?

Tet­lock: The best teams were skil­lful at im­pro­vis­ing di­vi­sions of la­bor, but that was more of an ad­minis­tra­tive [ac­com­plish­ment]. Still, it is im­por­tant, and we didn’t give them much help with it. What guidance we did give them, the best teams took se­ri­ously.

I would say the most im­por­tant at­tribute was the ca­pac­ity to dis­agree with­out be­ing dis­agree­able — to learn how to state your dis­agree­ments with other peo­ple in ways that don’t push them into a defen­sive cor­ner. Peo­ple are psy­cholog­i­cally frag­ile. And con­ver­sa­tions are not just in­for­ma­tion ex­changes. Truth-seek­ing is rarely the dom­i­nant goal. In most con­ver­sa­tions, the dom­i­nant goal is mu­tual face-sav­ing. We help each other along. And when one of us stum­bles, be­ing nice means we help pre­serve their so­cial iden­tity and rep­u­ta­tion for be­ing a good part­ner. We pre­serve the so­cial com­pact.

That of­ten means that if you say some­thing stupid, I say, “Yeah, in­ter­est­ing.” I don’t try to un­pack it and say, “Why, ex­actly, do you think that?” Great team mem­bers have the abil­ity to en­gage in rea­son­ably can­did ex­changes, in which peo­ple ex­plore the as­sump­tions un­der­ly­ing their be­liefs and share those per­spec­tives. They have the abil­ity — and the will­ing­ness — to un­der­stand an­other per­son’s point of view so well that the other per­son says, “I couldn’t have sum­ma­rized my po­si­tion any bet­ter than that.”

Every­body in the room would prob­a­bly agree that per­spec­tive-tak­ing is im­por­tant. But I’ll bet many of you wouldn’t en­joy try­ing to en­gage in per­spec­tive-tak­ing for John Bolton, the cur­rent na­tional se­cu­rity ad­vi­sor [this con­ver­sa­tion took place be­fore John Bolton was fired]. You might not want to sum­ma­rize his views on Iran or North Korea in such a way that he would say, “That is a con­cise and ac­cu­rate un­der­stand­ing of my views.” Many of you would say, “I think I’d rather slit my wrists.”

Labenz: It’s the John Bolton in­tel­lec­tual Tur­ing Test. It’s prob­a­bly not su­per-ap­peal­ing to most of us.

Tet­lock: Or imag­ine per­spec­tive-tak­ing for some­one more ex­treme than John Bolton. Try to see the world from Kim Jong Un’s point of view.

Labenz: That sounds challeng­ing, to say the least. So, [good team mem­bers] dis­agree with­out be­ing dis­agree­able, and can ac­cu­rately sum­ma­rize their team­mates’ po­si­tions and pass in­tel­lec­tual Tur­ing Tests.

On top of that, you added a layer of al­gorith­mic op­ti­miza­tion. Will [MacAskill] spoke a lit­tle bit, in his open­ing re­marks, about the challenge of be­com­ing clones of one an­other. If we are all clones of one an­other, then we can’t use our similar be­liefs as re­in­force­ment. And what you found in fore­cast­ing was that when peo­ple are cog­ni­tively di­verse, you were able to syn­the­size their be­liefs and ar­rive at more ac­cu­rate pre­dic­tions. You com­bined them in a clever way. Tell us how you did that.

Tet­lock: We had won­der­ful sup­port from some very in­sight­ful statis­ti­ci­ans like Lyle Un­gar, Emile Ser­van, and Jon Baron. They de­serve the credit for these al­gorithms.

The core idea is fun­da­men­tally sim­ple. Every­body has heard about “the wis­dom of the crowd” and how the av­er­age of the crowd tends to be more ac­cu­rate than the ma­jor­ity of the in­di­vi­d­u­als who con­tributed to that av­er­age. That is com­mon knowl­edge now.

To take it one step be­yond that, you use the most re­cent fore­cast and the best fore­cast to cre­ate a weighted av­er­age. That will do bet­ter than the un­weighted av­er­age.

The third step [comes into play in cases in which] the weighted av­er­age con­tains more in­for­ma­tion than you re­al­ized. That hap­pens when peo­ple who nor­mally dis­agree with each other sud­denly start agree­ing, and those num­bers are flow­ing into the weighted av­er­age.

It’s not that peo­ple who think similarly are pro­duc­ing the weighted av­er­age. There’s some cog­ni­tive di­ver­sity. The ex­am­ple I’m fond of is from the movie _Zero Dark Thirty_, in which James Gan­dolfini is play­ing former CIA di­rec­tor Leon Panetta. If you haven’t seen the movie, it’s worth watch­ing; it con­tains a re­ally great ex­am­ple of how not to run a meet­ing and how not to use prob­a­bil­ities.

Imag­ine that you’re the CIA di­rec­tor and each of your ad­vi­sors is pro­vid­ing a prob­a­bil­ity es­ti­mate of how likely it is that Osama bin Laden is in a par­tic­u­lar com­pound in the Pak­istani town of Ab­bot­tabad. Each of them says there is a prob­a­bil­ity of 0.7.

What ag­gre­ga­tion al­gorithm should the di­rec­tor use to dis­till the ad­vice? The ques­tion you need to ask first is: How cog­ni­tively di­verse and in­de­pen­dent are the per­spec­tives rep­re­sented in the room? If there are five peo­ple and each of them ar­rives at 0.7 us­ing a differ­ent type of in­for­ma­tion — cy­ber­se­cu­rity, satel­lite re­con­nais­sance, hu­man in­tel­li­gence, and so forth — and they’re in a siloed or­ga­ni­za­tion in which no­body talks to any­body else, what is the true prob­a­bil­ity? They’ve in­de­pen­dently con­verged on 0.7. And they say it was not math­e­mat­i­cally de­ducible from the in­for­ma­tion given, but can be statis­ti­cally es­ti­mated if there are enough fore­cast­ers mak­ing enough judg­ments on enough top­ics.

That’s what the statis­ti­ci­ans did. The an­swer in our case was to turn 0.7 into roughly 0.85. That’s ag­gres­sive and you run risks do­ing that. But it proved to be a very good way of win­ning the fore­cast­ing tour­na­ment.

Labenz: In other con­texts, I’ve heard you call that ex­treme-iz­ing the views of your most trusted ad­vi­sors. Is that what ul­ti­mately pro­duced the very best fore­cast over time?

Tet­lock: It did. But it did oc­ca­sion­ally crash and burn.

Labenz: What’s an ex­am­ple of a time when ex­treme-iz­ing went awry?

Tet­lock: There were a num­ber of method­olog­i­cal short­falls of the IARPA tour­na­ments. They used a timeframe. With a fore­cast, if you are given less and less time for Bashar al-As­sad to fall, or less and less time for Span­ish bond yield spreads to hit a cer­tain thresh­old value, you should grad­u­ally move your prob­a­bil­ity down­ward. Some­times they move down too quickly and take a hit be­cause some­thing would hap­pen at the last minute.

**Labenz:** I’d like to touch on a few other pro­jects that you’ve been in­volved with more re­cently. You were in­volved in a tour­na­ment that pit­ted hu­mans against ma­chines and hu­man-ma­chine hy­brids against each other. And you’re now in­volved in an­other IARPA pro­ject called FOCUS [Fore­cast­ing Coun­ter­fac­tu­als in Un­con­trol­led Set­tings], which I think is cur­rently go­ing on. So I’d love to hear a lit­tle bit about those two, start­ing with your ex­pe­rience with the hu­mans-ver­sus-ma­chines (and hy­brids) tour­na­ment.

Tet­lock: I’m not a for­mal com­peti­tor in the hy­brid fore­cast­ing com­pe­ti­tion, so I don’t have too much to say about that, ex­cept that the al­gorithms have a very hard time get­ting trac­tion on the kinds of ques­tions IARPA likes to ask. If you ask al­gorithms to win board games like Go or chess, or you use al­gorithms to screen the loan wor­thi­ness of cus­tomers for banks, al­gorithms wipe the floor with the hu­mans. There’s no ques­tion about that.

When you ask al­gorithms to make pre­dic­tions about the cur­rent Per­sian Gulf crisis and whether we are go­ing to be at war with Iran next week, they’re not very helpful. This goes back to the ob­jec­tion that you can’t as­sign prob­a­bil­ities to some cat­e­gories of events be­cause they’re too unique.They’re not re­ally quan­tifi­able. And the al­gorithms do in­deed strug­gle and the hu­mans are able to do bet­ter than the al­gorithms in that sort of con­text.

Is that always go­ing to be so? I don’t like to say “always.” I sus­pect not. But right now, given the state of the art, it may be that bank loan officers and deriva­tives traders should be wor­ried about their jobs. But should geopoli­ti­cal an­a­lysts? I’m not so sure. I don’t think so.

Labenz: Tell us a lit­tle bit about the FOCUS pro­ject that you’re now in­volved with as well.

Tet­lock: That rep­re­sents an in­ter­est­ing turn. It is a pro­ject that ad­dresses some­thing that has been of great in­ter­est to me for decades: the prob­lem of coun­ter­fac­tu­als and how difficult it is to learn from his­tory. It is psy­cholog­i­cally difficult to grasp how re­li­ant we are on coun­ter­fac­tual as­sump­tions when we ex­press ca­sual opinions that seem fac­tual to us.

After 1991, if you had asked con­ser­va­tives what hap­pened with the demise of the Soviet Union, they would have con­fi­dently said, “We won. Rea­gan won the Cold War.” And liber­als would have said, “The Soviet econ­omy was im­plod­ing, and that’s just the way it hap­pened. If any­thing, Rea­gan slowed it down. He didn’t help.”

Peo­ple act as if those state­ments aren’t coun­ter­fac­tual as­ser­tions. But upon close in­spec­tion, they are. They are based on the as­sump­tion that if Rea­gan had not won the elec­tion against Jimmy Carter and been inau­gu­rated in Jan­uary 1981, and a two-term Carter pres­i­dency and a two-term Mon­dale pres­i­dency had un­folded in­stead, then the Soviet Union’s demise would have hap­pened ex­actly the same way. That is awfully spec­u­la­tive. You re­ally don’t know. Coun­ter­fac­tu­als are the soft un­der­belly of hu­mans’ poli­ti­cal and eco­nomic be­lief sys­tems.

If peo­ple be­come more rigor­ous and thought­ful, they can ideally be­come more ac­cu­rate. They should also be­come bet­ter at ex­tract­ing les­sons from his­tory. That, in turn, should make you a bet­ter con­di­tional fore­caster — if you’re ex­tract­ing the right causal les­sons. But how do you know you’re get­ting bet­ter? You can’t go back in a time ma­chine and as­sess what Rea­gan did or didn’t do.

The so-called “solu­tion” that IARPA has come up with is to rely on simu­lated wor­lds. They have cho­sen the video game Civ­i­liza­tion 5 for the FOCUS pro­gram. How many of you have heard of that? [Many au­di­ence mem­bers raise their hands.] Oh my heav­ens. How in­ter­est­ing. Civ­i­liza­tion 5 has sev­eral func­tional prop­er­ties that make it in­ter­est­ing as a base for IARPA’s work. It has com­plex­ity, path de­pen­dency, and stochas­tic­ity or ran­dom­ness. Those are key fea­tures that make the real world ex­tremely hard to un­der­stand, and they ex­ist in Civ­i­liza­tion 5. What is differ­ent in the world of the game is data availa­bil­ity. You can as­sess the ac­cu­racy of coun­ter­fac­tual judg­ments in the simu­lated world of Civ­i­liza­tion 5 in a way that is ut­terly im­pos­si­ble in the real world. That was the lure for IARPA.

Once again, teams com­pete, but not to gen­er­ate ac­cu­rate fore­casts about whether there’s go­ing to be a war in the Per­sian Gulf. They com­pete to de­ter­mine whether differ­ent civ­i­liza­tions — the In­cas and the Swedes, for ex­am­ple — will go to war. You have to be very care­ful about trans­fer­ring real-world knowl­edge into the Civ­i­liza­tion 5 uni­verse. It has its own in­ter­est­ing logic. That’s also part of the challenge: How quickly can teams ad­just to the simu­lated world and learn to make more ac­cu­rate coun­ter­fac­tual judge­ments?

We’re be­gin­ning to use other simu­lated wor­lds, too. We’re do­ing a lit­tle bit of work, for ex­am­ple, with Bob Ax­elrod on an iter­ated Pri­soner’s Dilemma game. Th­ese games are easy to un­der­stand with­out noise. But when there’s noise, all hell breaks loose. So we’re in­ter­ested in peo­ple’s abil­ity to mas­ter var­i­ous simu­lated wor­lds. And the leap of faith is if you get re­ally good at rea­son­ing in these simu­lated wor­lds, you’ll be able to do bet­ter in the ac­tual world. That will be the ul­ti­mate val­i­da­tion: if par­ti­ci­pants re­turn to the real world and are bet­ter at coun­ter­fac­tual rea­son­ing. [Our goal is] to train in­tel­li­gence an­a­lysts by hav­ing them go through these pro­to­cols so that they are sub­se­quently bet­ter at mak­ing con­di­tional fore­cast judg­ments that can per­haps save save lives and money.

Labenz: I think that is in­ter­est­ing and rele­vant to the EA com­mu­nity. One of the move­ment’s early in­tel­lec­tual foun­da­tions was to make an im­pact. And that’s in­her­ently a coun­ter­fac­tual ex­er­cise, right? What would have hap­pened if bed­nets were not dis­tributed, or if chil­dren were not de­wormed, or if cash trans­fers were not handed out? It is a big challenge to come up with rigor­ous an­swers in the real world, where you don’t have ac­cess to con­crete an­swers about what would have hap­pened oth­er­wise.

Tet­lock: I think those ques­tions are eas­ier to an­swer be­cause you can do ran­dom­ized con­trol­led tri­als. There is a con­trol group. And when you have a con­trol group, you don’t need the coun­ter­fac­tu­als any­more. Coun­ter­fac­tual his­tory is, in a sense, imag­i­nary con­trol groups that we con­struct when we’re in a data-poor en­vi­ron­ment.

Labenz: That’s fas­ci­nat­ing. A lot of that work has been done. And as the move­ment has evolved over the last three to five years, there has been a shift. We have run ran­dom­ized con­trol­led tri­als where we can.

But at some point, you run out of things that are eas­ily testable. You have a port­fo­lio of in­cred­ible win­ners, such as bed­nets, that are like the Google and Face­book equiv­a­lents of char­ity. But then you need to find new ar­eas, es­pe­cially as the move­ment gets big­ger and at­tracts more re­sources. Now we have a bit of a blank can­vas and a re­ally huge prob­lem space.

So, I think coun­ter­fac­tual rea­son­ing starts to get to some of the hard­est ques­tions that the EA com­mu­nity is fac­ing. I’d love to talk about a few di­men­sions of that prob­lem. One has to do with am­bi­guity. In the fore­cast­ing tour­na­ments that you’ve de­scribed, there is a clear out­come: ei­ther the bond spreads hit the thresh­old or they don’t, or As­sad is in power or he’s not. But so many things of tremen­dous im­por­tance are much more am­bigu­ous. It’s difficult to clearly state what did hap­pen. How have you thought about tack­ling that challenge? Or, if you haven’t tack­led it, what are les­sons that come out of the more struc­tured challenges that could be ap­plied to much more am­bigu­ous situ­a­tions?

Tet­lock: I’m not quite sure I un­der­stand the ques­tion. But I do think that med­i­tat­ing on his­tor­i­cal coun­ter­fac­tu­als is a use­ful form of con­scious­ness-rais­ing. I’m not talk­ing about so­cial sci­ence fic­tion, like The Man in the High Cas­tle, which I think is a differ­ent form of an in­ter­est­ing thought ex­per­i­ment.

I think his­tor­i­cal coun­ter­fac­tu­als are a very use­ful way of sen­si­tiz­ing you to your ig­no­rance. They are a use­ful cor­rec­tive to over­con­fi­dence. When you un­der­stand that the causal propo­si­tions driv­ing your fore­cast rest on as­sump­tions that are de­bat­able, you tend to be­come bet­ter cal­ibrated. Some­times the coun­ter­fac­tual as­sump­tions are not as de­bat­able be­cause they rest on — if not ran­dom­ized con­trol­led tri­als — very so­phis­ti­cated econo­met­ric tests. In that case, a bit more con­fi­dence may be jus­tified. But be­ing aware of how much cre­dence is rea­son­able to put in the soft un­der­belly of your be­lief sys­tem is a use­ful form of con­scious­ness-rais­ing.

Labenz: Another challenge for the EA com­mu­nity: It can be hard to cal­ibrate your effort when con­sid­er­ing long time hori­zons or low-prob­a­bil­ity events. What will the pop­u­la­tion of the world be in 300 years? That is a ques­tion that is im­por­tant to a lot of peo­ple in the room. They feel that if it’s a thou­sand times what it is now and there is an op­por­tu­nity to im­pact that out­come, then they should care about it a thou­sand times as much. But [there is the is­sue of] low dis­count­ing. When things get ei­ther very rare or very far into the fu­ture, it is tough to think about de­vel­op­ing good judg­ment.

Tet­lock: I was just at Ted Nord­haus’s Break­through In­sti­tute meet­ing at Cavallo Point. There were a lot of peo­ple at that meet­ing who are spe­cial­ists in cli­mate and pop­u­la­tion, and who had mod­els of what the world might be like in 2100, or even 2300. Th­ese were ag­gres­sive, long-range fore­casts way be­yond any­thing we look at. In do­ing poli­ti­cal judg­ment work, our longest fore­cast is for five to 10 years, and in our work with IARPA, the longest fore­casts are 18 to 24 months. Most of them are 12 months ahead or less.

How do you bridge these short-term fore­cast­ing ex­er­cises with the need to en­gage in longer-term so­cietal plan­ning? Maybe one of the more rad­i­cal im­pli­ca­tions of a lack of ac­cu­racy in short-term fore­cast­ing is that long-range plan­ning does not make sense. What are you plan­ning for if you can’t fore­cast very much? That makes some peo­ple very up­set. The In­ter­gov­ern­men­tal Panel on Cli­mate Change , for ex­am­ple, has fore­casted out to the year 2100. That fore­cast is one of the key un­der­pin­nings of con­cern about cli­mate change. You have nit­pick­ers who say you can’t re­ally pre­dict any­thing.

You can pre­dict some longer-term things if there is a very strong sci­ence base for them. And of course, that is the ar­gu­ment made about cli­mate change and, to some de­gree, pop­u­la­tion. How­ever, the spread of es­ti­mates on pop­u­la­tion is sur­pris­ingly wide.

We’ve been de­vel­op­ing a method­ol­ogy we call “Bayesian ques­tion clus­ter­ing,” which is de­signed to bridge the gap be­tween short- and medium-term fore­cast­ing so that peo­ple can get feed­back on the ac­cu­racy of their judg­ments on a hu­man timescale.

Slide2

This [Tet­lock refers to the slide] is one of those “big Davos nose­bleed ab­strac­tion” things. [Tet­lock laughs.] Are we on a tra­jec­tory for AI to drive a fourth in­dus­trial rev­olu­tion, which will dis­lo­cate ma­jor white-col­lar la­bor mar­kets by 2040 or 2050? I won’t have to worry about it [I’ll likely be re­tired], but maybe this au­di­ence has rea­son to be con­cerned. And what would you ex­pect to ob­serve in 2015 or 2016 if we were on that tra­jec­tory? You might ex­pect AlphaGo to beat the world Go world cham­pion in 2016.

That hap­pened. Does that mean we’re on a tra­jec­tory to­ward this grand sce­nario [of AI driv­ing a fourth in­dus­trial rev­olu­tion]? No, it doesn’t. But does it in­crease the prob­a­bil­ity? Yes, a lit­tle bit. And these ques­tions [on the slide] were nom­i­nated by sub­ject-mat­ter ex­perts. It’s not that each of these micro-in­di­ca­tors has some di­ag­nos­tic­ity vis-a-vis the ul­ti­mate out­come: the driver­less Uber cars pick­ing up peo­ple in Las Ve­gas in 2018 for fares (not just as an ex­per­i­ment); Wat­son MD beat­ing the world’s best med­i­cal di­ag­nos­ti­ci­ans in 2018; half of the ac­count­ing jobs in the U.S. be­com­ing au­to­mated by 2020 (that’s some­what un­der the con­trol of Congress, of course); robotics in­dus­try spend­ing ex­ceed­ing $155 billion in 2020. [Th­ese mile­stones are meant to in­di­cate how] fast we may be on the road to that par­tic­u­lar sce­nario. And I think the an­swer is: The fu­ture is com­ing, but more slowly than some peo­ple might have thought in 2015.

Will there be wide­spread au­tonomous robotic war­fare by 2025? Here are sev­eral in­di­ca­tors peo­ple de­vel­oped and the like­li­hood ra­tio as­so­ci­ated with each.

Slide3

There are two in­di­ca­tors that seem to be most in­dica­tive. One is Bos­ton Dy­nam­ics’ ATLAS robot com­plet­ing an Army land nav­i­ga­tion course off-tether by 2020. That has a like­li­hood ra­tio of 2.5, which means that if we are on that sce­nario tra­jec­tory, de­part­ing sig­nifi­cantly from 1.0 is in­for­ma­tive. Similarly, there is a like­li­hood ra­tio of 2.0 [for that sce­nario tra­jec­tory] if an un­manned com­bat aerial ve­hi­cle defeats a hu­man hu­man pi­lot in a simu­lated dogfight by 2020.

You can think of this as al­most a Ter­mi­na­tor sce­nario and how close we are to a world in which these kinds of mechanisms are, with­out our di­rec­tion, launch­ing at­tacks. A lot of things hav­ing to do with bat­tery life, face recog­ni­tion, and the so­phis­ti­ca­tion of tran­sis­tors would have to hap­pen. There is a whole se­ries of in­di­ca­tors.

I’m not go­ing to claim that this se­ries is cor­rect. It hinges on the ac­cu­racy of the sub­ject-mat­ter ex­perts. But I do think it’s a use­ful ex­er­cise for bridg­ing short- to medium-term fore­cast­ing with longer-term sce­nar­ios and plan­ning. If you want to stage a de­bate be­tween Steve Pinker-style op­ti­mists and doom­sters, and ask what in­fant mor­tal­ity in Africa will look like in the next three, five, or 10 years, [you could try this type of ap­proach].

Labenz: Just to make sure I un­der­stand the like­li­hood ra­tios cor­rectly: Do they come from the cor­re­la­tions among fore­cast­ers?

Tet­lock: They are speci­fied by sub­ject-mat­ter ex­perts.

Labenz: So they are taken as an in­put.

Tet­lock: Yes, it’s es­sen­tially an in­put to the model. And by the way, it comes back to the ques­tion at the be­gin­ning of this con­ver­sa­tion: What are sub­ject-mat­ter ex­perts good for? One of the things they are re­ally good for is gen­er­at­ing in­ter­est­ing ques­tions. I’m not so sure they’re good at gen­er­at­ing the an­swers. But they are good at gen­er­at­ing ques­tions.

Labenz: Speak­ing of in­ter­est­ing ques­tions, you men­tioned the up­com­ing elec­tion ear­lier. Is there a way to use some of these tech­niques to get a han­dle on how im­por­tant some­thing like the next elec­tion might be? Can we use this kind of con­di­tional fore­cast­ing to tell us what re­ally mat­ters so that we can fo­cus on try­ing to im­pact par­tic­u­lar out­comes?

Tet­lock: I don’t know how many of you have heard of Dun­can Watts. He’s brilli­ant. He just wrote a pa­per for Na­ture on pre­dict­ing his­tory, a text-anal­y­sis pro­ject that goes back at least 100 years. What did peo­ple in 1900 think were re­ally im­por­tant [events] that would be re­mem­bered by peo­ple in 2000 or 2020? And which events did peo­ple in 2000 and 2020 _ac­tu­ally_ think were im­por­tant? When we are im­mersed in the pre­sent, there’s a strong ten­dency for us to think many things are ex­tremely im­por­tant, but peo­ple in 50 or 100 years do not find them im­por­tant.

It raises an in­ter­est­ing ques­tion: Who’s right? It could be that we, in the pre­sent, are highly aware of how con­tin­gency-rich our en­vi­ron­ment is, and how there are dra­mat­i­cally differ­ent pos­si­ble wor­lds lurk­ing. If we have an elec­tion that goes one way or the other, or if the econ­omy goes one way or the other, or if there’s a war in the Per­sian Gulf, will that be huge? Or will it be a foot­note in his­tory 100 years from now, or maybe not even a foot­note?

Dun­can Watts’s ar­ti­cle is lovely. It is a method­olog­i­cally beau­tiful ex­am­ple of the use of text anal­y­sis. It shows these in­ter­est­ing asym­me­tries in how we look at the world and it is a form of con­scious­ness-rais­ing. It helps us calm down a bit and look at our­selves as part of a longer tem­po­ral con­tinuum, which is, of course, a prob­lem a lot of peo­ple have with our work and fore­cast­ing tour­na­ments. They think it in­duces “anal­y­sis paral­y­sis” among ac­tivists, be­cause it’s very hard to build up mo­men­tum around some­thing like cli­mate change.

For ex­am­ple, if I think there’s a 72% chance the IPCC is cor­rect about cli­mate change, am I a cli­mate-change be­liever or am I a de­nier? What kind of crea­ture am I who would say such a thing? A pain in the butt! Some­body who is ob­struct­ing poli­ti­cal dis­course and poli­ti­cal progress.

Dun­can Watts’s pa­per sug­gests that what we think is im­por­tant is prob­a­bly not go­ing to stand the test of time very well. But there are ex­cep­tions: the dis­cov­ery of DNA, the atomic bomb, Hitler. Peo­ple at the time saw those as big events and they were right. But peo­ple are more of­ten wrong than right [in as­sess­ing what is “big” as it is hap­pen­ing].

Labenz: So if you think some­thing in your mo­ment in his­tory is im­por­tant, you’re prob­a­bly wrong. Most things are not that im­por­tant. But some things are.

If I’m try­ing to make a de­ci­sion about how in­volved I want to be in a par­tic­u­lar prob­lem, like the up­com­ing elec­tion, should I go to a team of su­perfore­cast­ers and pose con­di­tional ques­tions? For ex­am­ple: If Trump wins ver­sus some­body else, how likely is a war with Iran? How likely is a nu­clear ex­change? And should I then trust that to guide my ac­tions, or would I be ex­tend­ing the power of su­perfore­cast­ing too far?

Tet­lock: I’m not sure that su­perfore­cast­ing, as it is cur­rently con­figured, is ready for prime-time ap­pli­ca­tions. I think it would need to be ramped up con­sid­er­ably. You would want to have a very ded­i­cated group of good fore­cast­ers who were di­verse ide­olog­i­cally work­ing on that. And I don’t think we have that in­fras­truc­ture at the mo­ment. I think, in prin­ci­ple, it is pos­si­ble, but I wouldn’t recom­mend that you do it within the cur­rent struc­ture.

Labenz: That could be a po­ten­tial EA pro­ject. One of the big pre­oc­cu­pa­tions of this com­mu­nity is iden­ti­fy­ing the right things to work on. That is a vex­ing prob­lem in and of it­self, is even be­fore you get to the ques­tion of what can be done.

We don’t have much time re­main­ing, but I want to give you the chance to give some ad­vice or share your re­flec­tions on the EA com­mu­nity. You were on the 80,000 Hours pod­cast, so you are at least some­what aware of the com­mu­nity — a group that is highly mo­ti­vated but wrestling with hard ques­tions about what re­ally mat­ters and how to im­pact out­comes.

Tet­lock: One thing that is a bit dis­con­nected from the work we’ve been talk­ing about on fore­cast­ing tour­na­ments is re­lated to what I call the “sec­ond-gen­er­a­tion tour­na­ment.” The next gen­er­a­tion of tour­na­ments needs to fo­cus on the qual­ity of ques­tions as much as the ac­cu­racy of the an­swers.

Some­thing that has always in­ter­ested me about EA is this: How util­i­tar­ian are you, re­ally? I wrote a pa­per in 2000 called “The Psy­chol­ogy of the Un­think­able: Ta­boo Trade-Offs, For­bid­den Base Rates, and Hereti­cal Coun­ter­fac­tu­als.” It was about the nor­ma­tive bound­aries that all ide­olog­i­cal groups, ac­cord­ing to my model, place on the think­able. There are some things we just don’t want to think about. And effec­tive al­tru­ism im­plies that you’re will­ing to go wher­ever the util­i­tar­ian cost-benefit calcu­lus takes you. That would be an in­ter­est­ing ex­cep­tion to my model. So I’m cu­ri­ous about that.

I’ve had con­ver­sa­tions with some of you about taboos. What are the taboo cog­ni­tions [in the EA com­mu­nity]?

Au­di­ence mem­ber: Nick Bostrom wrote an ar­ti­cle about the haz­ards of in­for­ma­tion and how some knowl­edge could be more harm­ful than helpful.

Tet­lock: Yes. And Cass Sun­stein makes ar­gu­ments along those lines as well. Are there some things we’re bet­ter off not know­ing? Is ig­no­rance bet­ter? I think very few of us want to know the ex­act time we’re go­ing to die. There are some cat­e­gories of ques­tions that we just don’t want to think about. We’d pre­fer not to en­gage.

But I had more spe­cific is­sues in mind. Are there some cat­e­gories of things where mem­bers of the EA com­mu­nity just don’t want to en­gage? I was hav­ing a con­ver­sa­tion in the green room with an in­ter­est­ing per­son who is in­volved in an­i­mal rights. She seemed like a lovely per­son. I think there is a sig­nifi­cant amount of in­ter­est in an­i­mal rights and an­i­mal suffer­ing, and I know there are many liber­tar­i­ans and a fair num­ber of so­cial democrats in the EA com­mu­nity, but are there any fun­da­men­tal­ist con­ser­va­tives? Are there peo­ple who are con­cerned about abor­tion? Would the cause of fe­tal rights be con­sid­ered be­yond the pale?

Au­di­ence mem­ber: It’s not be­yond the pale.

Tet­lock: I’m not the most so­cially sen­si­tive per­son, but I’ve worked in a uni­ver­sity en­vi­ron­ment for 40 years. I’m go­ing to guess that 99% of you are pro-choice. How many of you would say that [fe­tal rights as a cause] is be­yond the pale?

Au­di­ence mem­ber: I had a con­ver­sa­tion about it last night.

Tet­lock: What top­ics might lie in the taboo zone? I sup­pose pro-Trump cog­ni­tions might be in the taboo zone — the idea that Trump has saved us from a nu­clear war in the Korean pen­in­sula be­cause he’s such a wheeler-dealer.

Labenz: I’ve cer­tainly heard some sincerely pro-Trump po­si­tions at EA Global in the past. I do think you’re fac­ing an au­di­ence that is very low on taboos. I bet we could find at least one or two.

Tet­lock: This is very dis­so­nant for me. I need to find taboos. [Laughs.]

Labenz: Let’s go back to the origi­nal ques­tion. Let’s say we are a group that is gen­uinely low on taboo top­ics and will­ing to con­sider al­most any­thing — maybe even fully any­thing.

Au­di­ence mem­ber: Hu­man sac­ri­fice?

Labenz: [Laughs.] Well, we need to watch out for the [trol­ley] thought [trol­ley] thought ex­per­i­ment, right? But if we are open to most [lines of in­quiry], is there a down­side to that as well? And what sort of ad­vice do you have for a group like this one?

Tet­lock: It is prob­a­bly why I re­ally like this group. It pushes open-mind­ed­ness to a de­gree that I have not seen in many or­ga­ni­za­tions. It’s un­usual.

You can end with the Dilbert car­toon. It tells you how eas­ily fore­cast­ing can be cor­rupted and why fore­cast­ing tour­na­ments are such a hard sell.
Slide1