Steelmanning the Case Against Unquantifiable Interventions

There is a strong argument that many of the most important, high-value interventions cannot be robustly quantified. For example, corruption reduction efforts and other policy change to increase long-term economic growth, therefore saving lives and reducing suffering, are high-importance, plausibly tractable, and relatively neglected areas. Using the ITN framework, or almost any other expected-value prioritization approach, this means they should be prioritized highly. I want to briefly outline and steelman the counterargument, that these have very low expected value compared to the best short-term and quantifiable interventions, and pick apart some uncertainties and issues. (This argument is not particularly novel, but I haven’t seen it made clearly before.)

This is not about long-term interventions, or existential risk-reduction, where a number of different concerns apply, and thinking about it is even harder, so I’m not doing so. For the sake of this discussion, this translate into something like assuming a relatively high discount rate so that the long-term doesn’t matter.

Epistemic Status /​ Notes: When first encountering EA, one of my early reactions was to say that these “soft” /​ “squishy” policy interventions were underappreciated. (In my defense /​ clouding my judgement, I was in a public policy PhD program at the time.) On reflection, I have mostly changed my mind, or at least become far less certain, which is what led to this write-up.

I am NOT highly confident in all of these arguments, but I have put in time thinking about them. Lastly, hard-to quantify and deeply uncertain interventions about existential risk reduction and the long term future are different in key ways, thinking about them often makes my head hurt, I don’t have clear conclusions, and I have personal biases, so I’m not going to discuss it here.

Outline

First, to understand the distribution of expected outcomes, I’ll discuss why the best interventions that we know of are orders of magnitude more effective than the average intervention, and why it’s hard to find more.

Second, I’ll review Peter Hurford’s arguments about why we should be wary of “speculative” causes, focusing on the uncertainty and non-quantifiable outcomes rather than his discussion of long-termism and existential risk.

Lastly, I’ll suggest that comparing expected value of investments is fundamentally intractable, and suggest why this would lead to focusing on short term quantified investments versus focusing on deeply uncertain or non-quantifiable issues.

The best quantifiable interventions

By now, it is a fairly accepted empirical observation that when considering interventions, the distribution of impact of extant charities is fat-tailed. The items way out in the tail are things like distributing bed-nets, deworming, and giving directly. The question is why these are good, and why are there few of them.

Why are the best interventions so great?

Why are these so effective? It takes an unusual combination of factors to make an intervention highly effective. To start, the intervention must address an important issue, and the intervention must either be novel, or the area must be neglected.

To explain why this is true, first note that marginal investment has decreasing returns in most domains. That is, the first million dollars spent on preventing starvation is likely to be far more effective than the thousandth million. Second, for most important problems, billions have been spent on a problem over decades. Even if only a fraction is spent relatively effectively, it will be very unusual for there to be low hanging fruit, unless a novel approach to address the problem is found. So if a cause is not neglected (relative to the scale of the problem,) and the intervention is not novel, it is unlikely that the intervention will be highly effective.

This is as true for current effective charities as is it is in general—once almost all people impacted by Malaria have bed nets, and almost everyone susceptible to schistosomiasis is dewormed, they will stop being priorities for effective charity. (This is, of course, a good thing.)

Even for novel interventions in neglected areas, however, we find very few highly effective interventions. Why?

Why aren’t we finding more of these effective interventions?

Even given the insight that we are looking for novel approaches and neglected problems, most charities are comparatively ineffective. While it is easy to forget, it took significant amounts of research across a very broad set of interventions to identify the few that work really well. Givewell considered 300 charities in 2009, and 400 up to that point, in order to identify what ended up as 10 charities to recommend. Of these, only 1 is still recommended as effective. (And none are currently not recommended because of insufficient room for funding, or the problem being solved.)

The low “hit” rate for very effective interventions is even more noteworthy given the filters that were in place. Of the tens of thousands of programs that are tried, a small fraction are found worthwhile enough to be continued and made into ongoing charities. Of those, most weren’t nominated for review by Givewell, likely in part because there is little reason to even suspect they are highly effective. Even among those few, many had evidence that pointed to a high likelihood that they were lower-value than the best charities.

This is to be expected. Rossi’s Iron Law Of Evaluation states that the expected value of any net impact assessment of any large scale social program is zero. Even given fat tails, most interventions are flops. If effective interventions are rare, and are exploited and driven to be ineffective once the problem is solved, there should be very few such interventions.

The next question is why we find any at all.

Explaining Observed Success

If good interventions are expected to be rare, how do we explain the fact that we do, in fact, find that some interventions are orders of magnitude more effective than others?

I claim that the question has a simple answer: Persi Diaconis and Frederick Mosteller’s law of truly large numbers. With a large enough number of samples, any unlikely thing is likely to be observed. And since we keep trying interventions, and we actively search for those that are effective, and (slowly) abandon those that don’t work, it should be unsurprising that we find some that are relatively far out in the distribution.

A bit more technically, if we’re trying to sample with a bias towards the tail, the tail of the overall distribution doesn’t need to be fat for the observed tail to be fat. What this (tentative) model implies, however, is that even when looking, we’re still unlikely to find interventions that are orders of magnitude more effective than those we currently see.

Objections to speculative interventions

Peter Hurford notes that speculative interventions are a priori unlikely to be effective, for several reasons related to the above argument which are worth reviewing. (There are other arguments which seem less relevant, and so they will not be discussed.) I will review them somewhat critically, but the arguments all are relevant and support his argument.

We Can’t Guess What Works

In addition to the above issue with effective interventions being rare, he points out that people are bad at guessing which interventions are effective, and which are ineffective. People’s inability to guess which programs from among those that are actively tried are helpful is evidence about their ability to predict the success of future programs.

However, I will claim that it is far weaker evidence than it seems. That’s because there is a pre-selection of programs that are evaluated, and programs that most people would expect to be low value are never pursued. You might object that Rossi’s Iron law applies to interventions that are tried—the expected value of even the interventions that seem plausible is zero.

Still, Rossi’s Iron Law is likely misleading because the sample of attempted interventions is biased as well. The mean is zero, he notes, only after we exclude the most obvious and effective programs. As Rossi’s Zinc law states, “Only those programs that are likely to fail are evaluated.” I will note that this non-evaluation was probably true in 1978 when the paper was written, but the movement towards cost-effectiveness analysis means that even effective interventions are likely to be tested to see how effective they are. Given that, it is unclear whether the stylized fact /​ empirical observation of the Iron Law is still correct.

But going a step further, a complete sample would include not only likely effective programs, but obviously effective things, like whether schools teaching a subject, say, biology, is effective at increasing children’s knowledge of that material, or whether distributing food in regions with starving people reduces deaths from starvation. In both cases, people would, presumably, correctly guess the direction of the effect.

I’ll call this the Diamond Law of Evaluation: Programs that are effective in obvious enough ways are not evaluated on that basis. Any evaluation of Give-Directly is about spillover effects, or effects over time, not about whether giving someone money makes the amount of money they have increase. That means that some proportion of proposed interventions should be expected to work well.

Isn’t there a Track Record of Failures?

If we can predict what works, why is there is a track record of failure? As noted above, 90% of the charities initially chosen by Givewell are no longer recommended. Hurford argues that this means they failed, and this failure rate should be expected to apply to future programs.

But on review the track record doesn’t imply these interventions failed, exactly. They were not found to be ineffective or harmful. Instead, most such charities were downgraded in effectiveness, but it seems that none were found to have negative or even near-zero impacts. The track record also shows that focusing on a priori important and neglected areas is a good way to find effective interventions.

So we can conclude that there are a bunch of factors we care about, and they have different influences.

Predicting Effectiveness

Given all of the factors we care about, we need to look at a few different factors to find an estimate of the expected value. Before looking at that, we should reconsider what we’re estimating.

Cost Effectiveness isn’t Effectiveness

The previous discussion has mostly focused on impacts of interventions, not cost. Thankfully, it turns out that cost is much easier to predict than impact—it’s not exact, but we’re shocked if our estimate of cost is off by an order of magnitude, and we’re only mildly surprised if our estimate of impact is off by a similar factor. This is critical, because it means impact estimates are far more important.

If we’re looking at a potential intervention affecting 10,000 people that costs $10 million, the difference between it saving 0.1% of the people and it saving 10% is tremendous; the first case is $100,000 per life saved, a worthwhile but not amazing intervention, versus $1,000 per life saved, putting it among the most effective interventions. The difference between it costing $9 million and $11 million, on the other hand, is tiny.

This is particularly true when we’re uncertain if an intervention works at all. Impact evaluations might show an improvement of 0.1% ,or an improvement of 10% − 2 order of magnitude difference. It might also show that the impact is −0.1%, which is a much bigger deal, since it means we’re paying money to make things worse. Cost, however, is rarely this uncertain, and it’s not usually negative (it can happen, but that’s not our current focus.)

Past Performance Doesn’t Guarantee Future Results

Given the discussion above, we should naively expect that the effectiveness of future interventions is distributed similarly to the effectiveness of past interventions. This isn’t quite correct, because the future interventions we’re looking at are ideas that we think are the most effective, rather than the full set of interventions that get tried.

We do have a comparison class for this, since Givewell has been giving out Incubation Grants that (in part) fund exactly this sort of intervention that is expected to be effective. Unfortunately, there are only a few data-points, so our inference will be weak. Not only that, but the differences in intervention types mean that we can’t even compare these, I would even argue that our prior estimates are probably more informative than the data.

But what about the outcomes?

Not only are samples of comparable interventions too small to make conclusions, but our estimates of the actual impacts will have a comparable problem. There are only a few dozen countries where you might want to run a country level anti-corruption effort. If we imagine that it increases GDP growth by 1% per year, GDP growth per year is really variable, and even if we did it everywhere, the sample size isn’t enough to let you control for the key variables and figure out if it is working. Our estimate of the change isn’t ever going to tell us the impact of our work. (I talked about this here, and came to an unsatisfactory conclusion.)

That means that we can’t conclude afterwords whether the intervention worked. Instead, we need theories of change, and surveys of corruption, and second order estimates of the impact based on that. In short, we won’t find out if our work helped. Instead, our feedback mechanism is based on our usually impossible to empirically test estimate, and we compare the estimate from this to our prior estimate of what we thought would happen. As Andrew Gelman said in a related vein, “the data have no direct connection to anything, so if these data are too noisy, the whole thing is essentially useless (except for informing us that, in retrospect, this approach to studying the problem was not helpful).”

If this seems like a hopeless muddle, it gets worse. If we can’t see what the effect is, we cannot improve our interventions on that basis. That means that it’s hard to selectively sample from the best types of interventions, since we don’t know what the best types are.

Gloomy Clouds, with a Ray of Hope

Basically, I would conclude that any attempt to pick new interventions, or fund internvetion types with impacts that are not quantifiable, is fundamentally problematic. Unless we can appeal to the Diamond law of Evaluation, that the impact of our work is logically dictated by the interventions, funding this type of work can only be justified on the basis of our ability to predict the outcomes. Unless we have some super-ability to forecast, this is doomed.

In fact, however, we do have some amazing forecasting methods. Unfortunately, they are restricted in scope to events with quantifiable outcomes of the type that could be used as feedback. If we find that markets can predict replicability of studies, we might be able to say more. Even then, however, it seems unlikely that we’d get precise enough feedback to reliably tell the difference between an intervention that will be very, very impactful and those that are only somewhat impactful.

That doesn’t mean that we shouldn’t continue to look for effective ways to impact systems that are hard to predict, but it does mean that it’s hard to justify any claims that most such programs will be nearly as impactful as just giving poor people money.