Michael Kremer: How much evidence is enough?

Link post


In this 2016 talk, the Harvard development economist Michael Kremer takes a Bayesian approach to the use of evidence. He argues that we should pursue interventions even if we are not very sure of their effects, but that we also should go on collecting more evidence for longer than we might naively think.

The below transcript is lightly edited for readability.

The Talk

Let me just briefly mention another role that I have, which is Scientific Director for a part of the US Agency for International Development called Development Innovation Ventures. So we provide funding, and I know some people out here are involved in initiatives either to pilot a new idea, or to rigorously test it. And then for those efforts that have rigorous evidence of impact and cost effectiveness or passing a market test, to get evidence to try to transition that to scale. So I encourage people who are interested in and have ideas that they think might be appropriate to visit our website and consider applying if you think it’s appropriate. We don’t have positions available right now, but if you’re interested in working in this type of environment, monitor our website as well in case we have positions open.

Just to reinforce one other point that came up, on the importance of developing country governments, to pick up on both Alan and Rachel’s comment. The work that the US government is doing and the British government is doing is incredibly important on this area. But, the Indian government also made a huge contribution in this area and that was in part based on the evidence. They brought the treatment of neglected tropical diseases to somewhere between 100 and 200 million more people on a sustained basis, overwhelmingly with Indian government funds. People in this room, Jane Powell, Evidence Action, many others played a big role in that.

Okay, so putting on my academic hat, there’s a question for effective altruists of at what point do you stop collecting evidence and at what point do you start acting? I would argue that those are two very different questions and that’s part of what I’ll be arguing in this. But also, how much evidence do we need for each of those decisions? So if you think that a goal for effective altruists is maximizing the impact of the resources that they have, you can start by thinking about a really simple case, the case where either something works or it doesn’t work. Then is it worth spending resources on it? Well, you just take the probability of the impact times the value of the impact, and you make the decision based on that.

More generally though, there might be a range of possible impacts. Let’s take this example of deworming. There are multiple studies out there estimating different impacts. You could assign some probability in a particular location to each of those impacts, and then you just sum up over each of the possible range impacts times the probability of that impact, and then you see if that justifies the use of the resources.

Now how do you value that? Well, for utilitarians the value would be in welfare or utility terms, and I’ll explain why that’s important in a second. This formula here looks very much like the formula for an investment decision if you’re a private investor, but there’s one big difference which I’ll get to.

So, a key concept in economics and just for effective altruists in particular is diminishing marginal utility of money. If you have a little bit of money you’ll spend it on things that are incredibly important like having enough food to eat. As you get more and more money, you can move down the hierarchy of priorities generally, to things like getting a bigger screen TV for example (this may be a slight oversimplification). And that improves happiness and welfare but probably not as much as a family that doesn’t have enough food getting enough food.

So, what are the implications of that? Well, one is that if you can transfer resources to the poorest that’s going to generally improve welfare. But you wouldn’t transfer a huge amount to one poor person to get them to the point where they could live a life where they’re having an extremely large flat screen TV. You would say let’s keep helping the poorest person once we’ve provide a bit of help to another.

Now, for an individual, say I’m considering betting a million dollars and giving up a million dollars in assets for the possibility of winning two million dollars. Well the formula that I put up earlier, if it’s expressed in dollar terms, might say that I should take that risk. But really it shouldn’t be in dollar terms—it should in welfare terms. The first million dollars means a lot more to me than the second million dollars, so I won’t actually gamble a million dollars versus two million dollars.

On the other hand, if you’re thinking about a situation with effective altruists and transfers, then it’s quite different because you’re not going to be moving down that curve of diminishing marginal utility very far. A 50% chance of getting a $10 per person benefit for 100 people is roughly equivalent to a 100% chance of reaching 50 people or a 100% chance of $5 benefit per person for 100. So these are all roughly equal because in poor people’s lives, their overall lifetime income is not going to be that affected. It’ll be better off but diminishing marginal utility won’t really hit that level.

So that’s an implication of a basic economic theory for effective altruism and I’ll come back to that in a second on this question of evidence. One other implication of diminishing marginal utility is that if you’re thinking about how to spend your money and how to give away your money, there’s a big cost for waiting. Why? Because right now poverty is diminishing at a rapid rate. There’s huge economic growth in many low-income countries. That means that spending your money now while people are still poor has an advantage over waiting. And obviously you can earn some money by keeping your money in the bank, but the interest rate probably won’t make up for that.

So, a general argument is that if you’re a standard utilitarian, which I understand many of the philosophers here are, then you should be pretty close to risk neutral in thinking about alternative investments in transferring to the poor. And by the way, there are all sorts of other areas that effective altruists are interested in such as existential risks, etc. I’m not going to comment on those, I’m going to focus on extreme poverty, as that’s the area that I know. And within that I’m going to focus on a few examples, which are also examples I know, so I don’t want to do to argue that there aren’t other very important causes.

Let me just note here that if you’re not a utilitarian, maybe you have some other approach and you really don’t like risk or you really don’t like ambiguity, which some people distinguish from risk. Well then you might want to support a charity like GiveDirectly, which is I think just transferring money to poor people. It’s very hard to see how that could go wrong and you can be quite confident. Or if you’re very pessimistic about other alternatives, you could go for GiveDirectly or if you’re thinking about a general policy change and you think a lot of development spending is spend badly and if you’re pessimistic about that you might say well better to move on to something like GiveDirectly.

So let me just go back to this formula summing up the probabilities of different outcomes times the values of those outcomes. What are the implications of that? Well one implication is that effective altruists I think recognize that high risk but very high payoff investments may be very reasonable, may be worth investing in. So if there are existential risks maybe it’s worth investing in addressing those. I’m not qualified to say whether they are or not but conceptually it makes sense.

Another implication is that there might be high risk investments that are not of the existential sort, but say you’ve got a candidate HIV vaccine which is very unlikely to work, but the payoff from having an HIV vaccine would be huge—it may be worth investing in that. And I think those are things that the community generally recognizes. However, I don’t want to claim everybody, but I think in the evidence-oriented community there may not be a full appreciation for another implication of this, which is that if you’re looking at moderate probabilities rather than extremely low ones, you should be using the same formula.

So if you can generate $1 for the poor with 99% probability, which I think you can through GiveDirectly or others, or if you can generate $1.25 for the poor with probability 0.8, well you might want to take the alternative that involves some risk. I’m not going to make the judgment here about what those probabilities are, but if you’re really serious about this then each person may have to evaluate what they think those probabilities are and what the values are.

You need to use all the information that you can in trying to estimate those values, all the relevant information, about both the values and the probabilities. So let me just give an example, the example of deworming. So if you think about mass drug administration for worms in a population, what type of evidence might you consider?

Well, there’s RCT evidence on mass drug administration. I’ve done some academic work on this including most recently a new meta-analysis of multiple studies, which is on my website (just to plug that a little bit). But you wouldn’t only consider that. You’d also think about that there’s also RCT evidence on deworming people who are known to be infected. And surely with any reasonable model of the world the evidence on the impact of deworming people who are known to be infected is relevant in thinking about the impact of treating a population in which, say, half the people are infected. And I think even the critics of mass drug administration would acknowledge that there’s an impact on the people who are infected, that they should be treated and that it’s highly cost effective.

But you’d also think about evidence from non-randomized studies that have to address potential confounding factors. If it’s a bad non-randomized study, you wouldn’t want to think about it. But if it does a good job of addressing confounding factors then it should enter into the analysis as well. I’m not saying necessarily with the same weight as a randomized trial, but it should be part of your thought process.

And you should also think about the underlying science. I’ll give an example. The underlying science would definitely suggest that in populations where there are more people with worms the benefits are likely to be greater than in populations where there are fewer people with worms. Why is that relevant? Well, think about policy, think about the people at the World Health Organization. They recommend mass drug administration and what they have come up with is a 20% prevalence threshold for recommending mass deworming. If it’s below that they think it’s not worth it. If it’s above that they think it is worth it.

Now when they’re making that decision, there’s just not sufficient evidence to know for sure that 20% is the exact right cut off. That was based on a judgment call and they have to take into account the available evidence now so they can make recommendations. A lot of developing countries will go with their recommendation and follow their recommendation, so they’ve got to make some choice. Even if they want more evidence, they’ve got to recommend something now and the approach which makes sense is to say based on the existing evidence what should we do.

Now another question (and I would argue a very different question) is should you invest in collecting additional evidence? Well there I think you should apply the same approach, and that’s what standard decision theory would suggest. You invest in additional evidence, if the cost of collecting that evidence is less than the expected value of the evidence. What’s the expected value of the evidence? There might be a pure science value and I don’t want to dismiss that, but from a utilitarian perspective it would be the probability that the additional evidence will change the future decision, times the expected value of changing the decision if it does so.

So, to be a little bit more concrete, (and there are some assumptions behind this, which I won’t go into detail about here), imagine that there’s a very low probability something’s going to work, let’s say a perpetual motion machine. We have strong reason to believe that it won’t work. Then it doesn’t make sense to invest a lot of resources in building the perpetual motion machine on a mass scale and having a factory to build that. But also it’s so low a probability that even though it would be great to have a perpetual motion machine, it probably doesn’t make sense to invest millions of dollars in research for it.

Now let’s say it’s a somewhat higher probability, so still low but somewhat higher. Let’s say there’s a new approach to teaching math or language, that there’s no evidence for but based on theoretical grounds we think it might be pretty effective. Well, it might not yet be time to implement that at scale because perhaps there are negative effects of doing so. But it probably does make sense to do some sort of staged investment in evidence. So the first stage probably wouldn’t be a full RCT. The first stage would probably be to pilot it in a couple of schools, see what the reaction is, see if it looks like it’s even feasible to go on to the next stage or appropriate to go on to the next stage. Then, eventually you might decide you’re going to have a full scale RCT.

Let’s say it’s a higher probability. I’m going to put a flossing as an example here because I just saw a newspaper article about it. The newspaper article pointed out, and I know nothing about flossing so let me clarify this, it said that the RCT evidence for this is not strong but dentists continue to recommend it. Well, my guess would be that the dentists, I don’t know and I haven’t investigated this, but my guess would be that the dentists probably know what they’re talking about and have some reason to believe this. So, probably the best thing to do for your health is to continue flossing for now. But, we should probably also collect additional evidence.

But let me talk about something on the other side of that line. Let’s say there is evidence for it, enough for a new drug to get past the FDA. Then probably if you have the disease you should consider taking the drug, or take the drug if it’s recommended for other medical reasons.

But does that mean we should stop collecting evidence? Almost certainly not. We might not need to do more randomized trials, maybe we should be doing additional randomized trials even though it’s already passed the FDA, but certainly we should be collecting some data on the impact of this drug. Could it have side effects that the FDA missed in its initial trial? Certainly, so it’s worth investing in additional evidence in many of these cases.

So despite the fact that based on the current evidence it makes sense to go ahead if we’re not very, very certain, it may well be worth getting additional evidence, particularly if that can be done cheaply. If there’s a super high probability of something and if it’s costly enough to do tests, then it’s no longer worth collecting more evidence. But what are the cut offs? I use these vague terms of extremely low, low, high, extremely high. The cut-offs will depend on the cost of the evidence collection. In particular, to be technical, they’ll depend on the option values associated with the stage of experimentation. So in some cases you might be able to experiment very cheaply, and then you have to think what the chance is that it’s going to pass this stage. If it passes this stage what’s the chance of going out and finishing it and it passing the next stage, etc.

Let me contrast this whole approach with one which I think is common in some communities. And while I think it’s less common in the effective altruism community, I don’t think it’s entirely absent. This approach is one that sort of artificially discretizes these continuous probabilities into two areas – an area where something is unproven and we say that we can’t invest in it because it’s unproven, and we keep doing more research while it’s still in the unproven category. And then there’s a black and white transition to another category where it’s proven and you stop doing research.

That’s not what would come out of a decision theory approach. You would have different cut-offs for when you implement something. You’d act now based on the best available evidence but you would continue to collect additional evidence for as long as enough residual uncertainty remains that there’s a chance you’re going to change your mind. And that I think is a an appropriate approach.

On the technical side, what are some implications for how we analyze evidence? I think where possible we should use a Bayesian approach, and that could be hard to do but I think it’s worth the effort. If you don’t have a Bayesian approach to statistics, if you have a frequentist approach to statistics, then you need to consider the power of tests.

So often you hear about 5% significance levels. People want to know that if they find an effect there’s a only a 5% chance that it’s not really there. But we should be thinking more broadly than that. We should be thinking about the power of a test, which is in particular the power against a particular hypothesis, which is one that involves a cost benefit analysis, the power against the hypothesis that this is not cost-effective. One way to put this is that we need to consider the cost of a false negative as well as the cost of a false positive.

Now all that said, I’m a big proponent of randomized trials, I think I played an important role in bringing these into development economics. So you might ask why I’m now saying that we should consider all types of evidence. Does this just throw all rigor out the window? Well I think it’s important to ask what is the reason to have perhaps a special role for randomized trials and for evidence.

Well one important reason is to combat biases that we would otherwise be subject to. So there are psychological biases. People for a variety of reasons might be inclined to support the activity with the picture of the cute cat or the cute animal as opposed to the animal that turns out to be super intelligent but not so cute. There are institutional biases. If there are organizations out there and their survival as organization and their continued employment of their staff depend on promotion of a particular cause, they’re going to promote that cause. It’s important to recognize that those biases may exist. Certainly I think that’s true for myself.

I think what can be particularly pernicious about these biases in the development space is that there’s a competitive market for donor funds and people are competing. And how do they compete? By trying to make appeals to people, and these are situations where it’s very hard for people to evaluate those appeals. It’s not a consumer product that people are using every day. It’s not even a politician whose track record you can look at—with a mayor, say, is the garbage being picked up or is the garbage not being picked up.

In the market for aid organizations that raise money in the developed world for the developing world, they are dealing with relatively uninformed consumers. And they face competitive pressure to make these emotional appeals that exploit psychological biases. So it does make sense to have some procedural safeguards, and so I think that’s one of the reasons, as well as the fact that you just get better evidence for it, to put some special weight on randomized trials.

But it’s important to recognize that that’s one reason to have these, and that the fundamental decision should be made based on the full set of evidence, where the better evidence should get weighted more highly. But in some cases we just don’t have the better evidence. It’s very hard to do a randomized trial. We’ll never have a randomized trial on Brexit, for example, but you should go with your best available information.

It’s also important to recognize that there’s a cost of delay. As I mentioned earlier in the development context, a lot of problems are getting better over time. So if you’re concerned about neglected tropical diseases, well 25 years from now we expect there are going to be a lot less worms for a whole host of reasons – more people are going to be wearing shoes, etc. It’s finally important to recognize that if you wind up going with the safe alternative, the thing with the 99% probability of a return of 1 versus the thing with a 70% probability of a return of 10, you’re potentially giving up a lot of welfare.

Let me just talk about a particular case. Diarrheal disease is a major killer of children in the developing world. The NGO Evidence Action is providing water treatment to millions of people through chlorine dispensers. It’s extremely cheap. A bottle of bleach you buy at the drugstore has enough chlorine in it to treat 70,000 households. So it’s very, very cheap, and the estimated costs per life saved or DALY saved is quite low. And what’s more, a substantial fraction of that cost is covered by carbon credits. So that implies a manyfold return on investment with the conventional evaluation of DALY. This number I’m just putting out for illustrative purposes—I need to do more work to try and come up with this—but perhaps something on the 10 to 1 scale.

What’s the evidence in this case? Let me start with the non-randomized trial evidence. Chlorine kills bacteria that cause the key types of diarrheal disease. There’s historical evidence, as US cities introduced water treatment you saw reductions in mortality and they’re very careful closely timed with the introduction of water treatment. There’s RCT evidence as well on the impact of cleaner water. Many RCT studies on the effect of chlorination finding that mothers report child diarrhea went down considerably.

But, I don’t want to claim there’s no uncertainty, there is uncertainty. One of the reasons, there are many reasons for it, but let me focus on one. There could be reporting bias. Maybe the mothers who are in the treatment group are reporting a fall in diarrhea but there wasn’t really a fall in diarrhea or not as big as was suggested. One thing you could theoretically do would be to have blinded trials, but turns out there’s just very few of them and the results are pretty inconclusive. They’re often conducted in places with very low diarrhea rates, 1 or 2%, and it’s very hard to pick up the statistical power, and its confounding with other interventions.

And so then you have to think, okay maybe the complete gold standard blinded trial isn’t there, but we do have a bunch of RCTs. You have to think to yourself well how much do we think the reporting is biased and what direction does it go. The reporting bias could go either way—it could easily lead to an underestimate of the treatment effect or it could lead to an overestimate of the treatment effect. There are theories and evidence on reporting bias and it could go in both directions.

So what should you do? Well, you should think about adjusting for this reporting bias by the mothers. There are other concerns about the mapping from diarrhea to mortality. I’m biased on this. I was involved in research on this topic, I’ve been involved in multiple things. And some of them I get convinced by, some of them I don’t. So I would argue that somewhat limits the bias, but I’ve written on this so I may have a bias on this.

But there’s also organizational bias you need to think through. I think much more important than the question of does chlorination reduce diarrhea is the question of whether the organization is doing a good job delivering it, what’s the real cost etc? I’ve been trying to look into these things a little bit myself to think about how we should direct our own charitable organizations. I want to collect more evidence on this. My current subjective estimate might be 0.7 instead of a 10 to 1 ratio, and most of this is not because of any other concerns than my own bias. It’s very subjective numbers, but I would drop off 20 or 25% for my own bias, so that might take me from 10 to 1 to 7 to 1 benefit cost ratio.

My wife and I are thinking about how to donate our contributions. Evidence Action is something that we currently intend to donate to. We’re still in the stage of collecting evidence and analyzing that evidence not by doing more RCTs but by looking at the numbers of Evidence Action etc, but that’s our current inclination. I will stress again that there are many other things out there and I don’t claim to have looked at all the alternatives and done a perfect analysis of that.

Okay, what’s the conclusion here? I think it makes sense for effective altruists to give according to the expected value, the probability of the impact times the value of the impact. There are costs of delay until you get perfect evidence. There are costs of being so ambiguity averse that you wait for perfect proof. And it’s therefore important to take all the evidence you have, sum it together and come up with some estimate of the probability and some estimate of the value. And that’s not going to be perfect, but we should recognize it.

And then when should we collect additional evidence? Well, that’s the same criteria, which is do the expected benefits of the additional evidence outweigh the costs. I would argue typically that’s going to mean that you keep collecting evidence even long after you’ve decided that based on current evidence you’re going to take an action. So thank you very much.

No comments.