Steelmanning the Case Against Unquantifiable Interventions

There is a strong ar­gu­ment that many of the most im­por­tant, high-value in­ter­ven­tions can­not be ro­bustly quan­tified. For ex­am­ple, cor­rup­tion re­duc­tion efforts and other policy change to in­crease long-term eco­nomic growth, there­fore sav­ing lives and re­duc­ing suffer­ing, are high-im­por­tance, plau­si­bly tractable, and rel­a­tively ne­glected ar­eas. Us­ing the ITN frame­work, or al­most any other ex­pected-value pri­ori­ti­za­tion ap­proach, this means they should be pri­ori­tized highly. I want to briefly out­line and steel­man the coun­ter­ar­gu­ment, that these have very low ex­pected value com­pared to the best short-term and quan­tifi­able in­ter­ven­tions, and pick apart some un­cer­tain­ties and is­sues. (This ar­gu­ment is not par­tic­u­larly novel, but I haven’t seen it made clearly be­fore.)

This is not about long-term in­ter­ven­tions, or ex­is­ten­tial risk-re­duc­tion, where a num­ber of differ­ent con­cerns ap­ply, and think­ing about it is even harder, so I’m not do­ing so. For the sake of this dis­cus­sion, this trans­late into some­thing like as­sum­ing a rel­a­tively high dis­count rate so that the long-term doesn’t mat­ter.

Epistemic Sta­tus /​ Notes: When first en­coun­ter­ing EA, one of my early re­ac­tions was to say that these “soft” /​ “squishy” policy in­ter­ven­tions were un­der­ap­pre­ci­ated. (In my defense /​ cloud­ing my judge­ment, I was in a pub­lic policy PhD pro­gram at the time.) On re­flec­tion, I have mostly changed my mind, or at least be­come far less cer­tain, which is what led to this write-up.

I am NOT highly con­fi­dent in all of these ar­gu­ments, but I have put in time think­ing about them. Lastly, hard-to quan­tify and deeply un­cer­tain in­ter­ven­tions about ex­is­ten­tial risk re­duc­tion and the long term fu­ture are differ­ent in key ways, think­ing about them of­ten makes my head hurt, I don’t have clear con­clu­sions, and I have per­sonal bi­ases, so I’m not go­ing to dis­cuss it here.


First, to un­der­stand the dis­tri­bu­tion of ex­pected out­comes, I’ll dis­cuss why the best in­ter­ven­tions that we know of are or­ders of mag­ni­tude more effec­tive than the av­er­age in­ter­ven­tion, and why it’s hard to find more.

Se­cond, I’ll re­view Peter Hur­ford’s ar­gu­ments about why we should be wary of “spec­u­la­tive” causes, fo­cus­ing on the un­cer­tainty and non-quan­tifi­able out­comes rather than his dis­cus­sion of long-ter­mism and ex­is­ten­tial risk.

Lastly, I’ll sug­gest that com­par­ing ex­pected value of in­vest­ments is fun­da­men­tally in­tractable, and sug­gest why this would lead to fo­cus­ing on short term quan­tified in­vest­ments ver­sus fo­cus­ing on deeply un­cer­tain or non-quan­tifi­able is­sues.

The best quan­tifi­able interventions

By now, it is a fairly ac­cepted em­piri­cal ob­ser­va­tion that when con­sid­er­ing in­ter­ven­tions, the dis­tri­bu­tion of im­pact of ex­tant char­i­ties is fat-tailed. The items way out in the tail are things like dis­tribut­ing bed-nets, de­worm­ing, and giv­ing di­rectly. The ques­tion is why these are good, and why are there few of them.

Why are the best in­ter­ven­tions so great?

Why are these so effec­tive? It takes an un­usual com­bi­na­tion of fac­tors to make an in­ter­ven­tion highly effec­tive. To start, the in­ter­ven­tion must ad­dress an im­por­tant is­sue, and the in­ter­ven­tion must ei­ther be novel, or the area must be ne­glected.

To ex­plain why this is true, first note that marginal in­vest­ment has de­creas­ing re­turns in most do­mains. That is, the first mil­lion dol­lars spent on pre­vent­ing star­va­tion is likely to be far more effec­tive than the thou­sandth mil­lion. Se­cond, for most im­por­tant prob­lems, billions have been spent on a prob­lem over decades. Even if only a frac­tion is spent rel­a­tively effec­tively, it will be very un­usual for there to be low hang­ing fruit, un­less a novel ap­proach to ad­dress the prob­lem is found. So if a cause is not ne­glected (rel­a­tive to the scale of the prob­lem,) and the in­ter­ven­tion is not novel, it is un­likely that the in­ter­ven­tion will be highly effec­tive.

This is as true for cur­rent effec­tive char­i­ties as is it is in gen­eral—once al­most all peo­ple im­pacted by Malaria have bed nets, and al­most ev­ery­one sus­cep­ti­ble to schis­to­so­mi­a­sis is de­wormed, they will stop be­ing pri­ori­ties for effec­tive char­ity. (This is, of course, a good thing.)

Even for novel in­ter­ven­tions in ne­glected ar­eas, how­ever, we find very few highly effec­tive in­ter­ven­tions. Why?

Why aren’t we find­ing more of these effec­tive in­ter­ven­tions?

Even given the in­sight that we are look­ing for novel ap­proaches and ne­glected prob­lems, most char­i­ties are com­par­a­tively in­effec­tive. While it is easy to for­get, it took sig­nifi­cant amounts of re­search across a very broad set of in­ter­ven­tions to iden­tify the few that work re­ally well. Givewell con­sid­ered 300 char­i­ties in 2009, and 400 up to that point, in or­der to iden­tify what ended up as 10 char­i­ties to recom­mend. Of these, only 1 is still recom­mended as effec­tive. (And none are cur­rently not recom­mended be­cause of in­suffi­cient room for fund­ing, or the prob­lem be­ing solved.)

The low “hit” rate for very effec­tive in­ter­ven­tions is even more note­wor­thy given the filters that were in place. Of the tens of thou­sands of pro­grams that are tried, a small frac­tion are found worth­while enough to be con­tinued and made into on­go­ing char­i­ties. Of those, most weren’t nom­i­nated for re­view by Givewell, likely in part be­cause there is lit­tle rea­son to even sus­pect they are highly effec­tive. Even among those few, many had ev­i­dence that pointed to a high like­li­hood that they were lower-value than the best char­i­ties.

This is to be ex­pected. Rossi’s Iron Law Of Eval­u­a­tion states that the ex­pected value of any net im­pact as­sess­ment of any large scale so­cial pro­gram is zero. Even given fat tails, most in­ter­ven­tions are flops. If effec­tive in­ter­ven­tions are rare, and are ex­ploited and driven to be in­effec­tive once the prob­lem is solved, there should be very few such in­ter­ven­tions.

The next ques­tion is why we find any at all.

Ex­plain­ing Ob­served Success

If good in­ter­ven­tions are ex­pected to be rare, how do we ex­plain the fact that we do, in fact, find that some in­ter­ven­tions are or­ders of mag­ni­tude more effec­tive than oth­ers?

I claim that the ques­tion has a sim­ple an­swer: Persi Di­a­co­nis and Fred­er­ick Mostel­ler’s law of truly large num­bers. With a large enough num­ber of sam­ples, any un­likely thing is likely to be ob­served. And since we keep try­ing in­ter­ven­tions, and we ac­tively search for those that are effec­tive, and (slowly) aban­don those that don’t work, it should be un­sur­pris­ing that we find some that are rel­a­tively far out in the dis­tri­bu­tion.

A bit more tech­ni­cally, if we’re try­ing to sam­ple with a bias to­wards the tail, the tail of the over­all dis­tri­bu­tion doesn’t need to be fat for the ob­served tail to be fat. What this (ten­ta­tive) model im­plies, how­ever, is that even when look­ing, we’re still un­likely to find in­ter­ven­tions that are or­ders of mag­ni­tude more effec­tive than those we cur­rently see.

Ob­jec­tions to spec­u­la­tive interventions

Peter Hur­ford notes that spec­u­la­tive in­ter­ven­tions are a pri­ori un­likely to be effec­tive, for sev­eral rea­sons re­lated to the above ar­gu­ment which are worth re­view­ing. (There are other ar­gu­ments which seem less rele­vant, and so they will not be dis­cussed.) I will re­view them some­what crit­i­cally, but the ar­gu­ments all are rele­vant and sup­port his ar­gu­ment.

We Can’t Guess What Works

In ad­di­tion to the above is­sue with effec­tive in­ter­ven­tions be­ing rare, he points out that peo­ple are bad at guess­ing which in­ter­ven­tions are effec­tive, and which are in­effec­tive. Peo­ple’s in­abil­ity to guess which pro­grams from among those that are ac­tively tried are helpful is ev­i­dence about their abil­ity to pre­dict the suc­cess of fu­ture pro­grams.

How­ever, I will claim that it is far weaker ev­i­dence than it seems. That’s be­cause there is a pre-se­lec­tion of pro­grams that are eval­u­ated, and pro­grams that most peo­ple would ex­pect to be low value are never pur­sued. You might ob­ject that Rossi’s Iron law ap­plies to in­ter­ven­tions that are tried—the ex­pected value of even the in­ter­ven­tions that seem plau­si­ble is zero.

Still, Rossi’s Iron Law is likely mis­lead­ing be­cause the sam­ple of at­tempted in­ter­ven­tions is bi­ased as well. The mean is zero, he notes, only af­ter we ex­clude the most ob­vi­ous and effec­tive pro­grams. As Rossi’s Zinc law states, “Only those pro­grams that are likely to fail are eval­u­ated.” I will note that this non-eval­u­a­tion was prob­a­bly true in 1978 when the pa­per was writ­ten, but the move­ment to­wards cost-effec­tive­ness anal­y­sis means that even effec­tive in­ter­ven­tions are likely to be tested to see how effec­tive they are. Given that, it is un­clear whether the stylized fact /​ em­piri­cal ob­ser­va­tion of the Iron Law is still cor­rect.

But go­ing a step fur­ther, a com­plete sam­ple would in­clude not only likely effec­tive pro­grams, but ob­vi­ously effec­tive things, like whether schools teach­ing a sub­ject, say, biol­ogy, is effec­tive at in­creas­ing chil­dren’s knowl­edge of that ma­te­rial, or whether dis­tribut­ing food in re­gions with starv­ing peo­ple re­duces deaths from star­va­tion. In both cases, peo­ple would, pre­sum­ably, cor­rectly guess the di­rec­tion of the effect.

I’ll call this the Di­a­mond Law of Eval­u­a­tion: Pro­grams that are effec­tive in ob­vi­ous enough ways are not eval­u­ated on that ba­sis. Any eval­u­a­tion of Give-Directly is about spillover effects, or effects over time, not about whether giv­ing some­one money makes the amount of money they have in­crease. That means that some pro­por­tion of pro­posed in­ter­ven­tions should be ex­pected to work well.

Isn’t there a Track Record of Failures?

If we can pre­dict what works, why is there is a track record of failure? As noted above, 90% of the char­i­ties ini­tially cho­sen by Givewell are no longer recom­mended. Hur­ford ar­gues that this means they failed, and this failure rate should be ex­pected to ap­ply to fu­ture pro­grams.

But on re­view the track record doesn’t im­ply these in­ter­ven­tions failed, ex­actly. They were not found to be in­effec­tive or harm­ful. In­stead, most such char­i­ties were down­graded in effec­tive­ness, but it seems that none were found to have nega­tive or even near-zero im­pacts. The track record also shows that fo­cus­ing on a pri­ori im­por­tant and ne­glected ar­eas is a good way to find effec­tive in­ter­ven­tions.

So we can con­clude that there are a bunch of fac­tors we care about, and they have differ­ent in­fluences.

Pre­dict­ing Effectiveness

Given all of the fac­tors we care about, we need to look at a few differ­ent fac­tors to find an es­ti­mate of the ex­pected value. Be­fore look­ing at that, we should re­con­sider what we’re es­ti­mat­ing.

Cost Effec­tive­ness isn’t Effectiveness

The pre­vi­ous dis­cus­sion has mostly fo­cused on im­pacts of in­ter­ven­tions, not cost. Thank­fully, it turns out that cost is much eas­ier to pre­dict than im­pact—it’s not ex­act, but we’re shocked if our es­ti­mate of cost is off by an or­der of mag­ni­tude, and we’re only mildly sur­prised if our es­ti­mate of im­pact is off by a similar fac­tor. This is crit­i­cal, be­cause it means im­pact es­ti­mates are far more im­por­tant.

If we’re look­ing at a po­ten­tial in­ter­ven­tion af­fect­ing 10,000 peo­ple that costs $10 mil­lion, the differ­ence be­tween it sav­ing 0.1% of the peo­ple and it sav­ing 10% is tremen­dous; the first case is $100,000 per life saved, a worth­while but not amaz­ing in­ter­ven­tion, ver­sus $1,000 per life saved, putting it among the most effec­tive in­ter­ven­tions. The differ­ence be­tween it cost­ing $9 mil­lion and $11 mil­lion, on the other hand, is tiny.

This is par­tic­u­larly true when we’re un­cer­tain if an in­ter­ven­tion works at all. Im­pact eval­u­a­tions might show an im­prove­ment of 0.1% ,or an im­prove­ment of 10% − 2 or­der of mag­ni­tude differ­ence. It might also show that the im­pact is −0.1%, which is a much big­ger deal, since it means we’re pay­ing money to make things worse. Cost, how­ever, is rarely this un­cer­tain, and it’s not usu­ally nega­tive (it can hap­pen, but that’s not our cur­rent fo­cus.)

Past Perfor­mance Doesn’t Guaran­tee Fu­ture Results

Given the dis­cus­sion above, we should naively ex­pect that the effec­tive­ness of fu­ture in­ter­ven­tions is dis­tributed similarly to the effec­tive­ness of past in­ter­ven­tions. This isn’t quite cor­rect, be­cause the fu­ture in­ter­ven­tions we’re look­ing at are ideas that we think are the most effec­tive, rather than the full set of in­ter­ven­tions that get tried.

We do have a com­par­i­son class for this, since Givewell has been giv­ing out In­cu­ba­tion Grants that (in part) fund ex­actly this sort of in­ter­ven­tion that is ex­pected to be effec­tive. Un­for­tu­nately, there are only a few data-points, so our in­fer­ence will be weak. Not only that, but the differ­ences in in­ter­ven­tion types mean that we can’t even com­pare these, I would even ar­gue that our prior es­ti­mates are prob­a­bly more in­for­ma­tive than the data.

But what about the out­comes?

Not only are sam­ples of com­pa­rable in­ter­ven­tions too small to make con­clu­sions, but our es­ti­mates of the ac­tual im­pacts will have a com­pa­rable prob­lem. There are only a few dozen coun­tries where you might want to run a coun­try level anti-cor­rup­tion effort. If we imag­ine that it in­creases GDP growth by 1% per year, GDP growth per year is re­ally vari­able, and even if we did it ev­ery­where, the sam­ple size isn’t enough to let you con­trol for the key vari­ables and figure out if it is work­ing. Our es­ti­mate of the change isn’t ever go­ing to tell us the im­pact of our work. (I talked about this here, and came to an un­satis­fac­tory con­clu­sion.)

That means that we can’t con­clude af­ter­words whether the in­ter­ven­tion worked. In­stead, we need the­o­ries of change, and sur­veys of cor­rup­tion, and sec­ond or­der es­ti­mates of the im­pact based on that. In short, we won’t find out if our work helped. In­stead, our feed­back mechanism is based on our usu­ally im­pos­si­ble to em­piri­cally test es­ti­mate, and we com­pare the es­ti­mate from this to our prior es­ti­mate of what we thought would hap­pen. As An­drew Gel­man said in a re­lated vein, “the data have no di­rect con­nec­tion to any­thing, so if these data are too noisy, the whole thing is es­sen­tially use­less (ex­cept for in­form­ing us that, in ret­ro­spect, this ap­proach to study­ing the prob­lem was not helpful).”

If this seems like a hope­less mud­dle, it gets worse. If we can’t see what the effect is, we can­not im­prove our in­ter­ven­tions on that ba­sis. That means that it’s hard to se­lec­tively sam­ple from the best types of in­ter­ven­tions, since we don’t know what the best types are.

Gloomy Clouds, with a Ray of Hope

Ba­si­cally, I would con­clude that any at­tempt to pick new in­ter­ven­tions, or fund in­tern­ve­tion types with im­pacts that are not quan­tifi­able, is fun­da­men­tally prob­le­matic. Un­less we can ap­peal to the Di­a­mond law of Eval­u­a­tion, that the im­pact of our work is log­i­cally dic­tated by the in­ter­ven­tions, fund­ing this type of work can only be jus­tified on the ba­sis of our abil­ity to pre­dict the out­comes. Un­less we have some su­per-abil­ity to fore­cast, this is doomed.

In fact, how­ever, we do have some amaz­ing fore­cast­ing meth­ods. Un­for­tu­nately, they are re­stricted in scope to events with quan­tifi­able out­comes of the type that could be used as feed­back. If we find that mar­kets can pre­dict repli­ca­bil­ity of stud­ies, we might be able to say more. Even then, how­ever, it seems un­likely that we’d get pre­cise enough feed­back to re­li­ably tell the differ­ence be­tween an in­ter­ven­tion that will be very, very im­pact­ful and those that are only some­what im­pact­ful.

That doesn’t mean that we shouldn’t con­tinue to look for effec­tive ways to im­pact sys­tems that are hard to pre­dict, but it does mean that it’s hard to jus­tify any claims that most such pro­grams will be nearly as im­pact­ful as just giv­ing poor peo­ple money.