Hi MHR,
I really appreciate substantive posts like this, thanks!
This response is just speaking for myself, doing rough math on the weekend that I haven’t run by anyone else. Someone (e.g., from @GiveWell) should correct me if I’m wrong, but I think you’re vastly understating the difficulty and cost of running an informative replication given the situation on deworming. (My math below seems intuitively too pessimistic, so I welcome corrections!)
If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).
I didn’t check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a “main effect” of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell’s most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:
A replicability adjustment of .13 (row 11)
A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)
Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we’d have an MDE of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we’d need the standard error to be .0017/2.8=~.00061 log points. So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs.
But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, getting a 108x smaller standard error would require a 108*108=11,664 times larger sample. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.
If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars. And of course it would take well over a decade to get the long term followup results here. (That said, it wouldn’t surprise me if I’m getting the math wrong here—someone please flag if so!)
I’m sure there are better study designs than the one I’m implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I’m skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs.
I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they’re so hard to study and it’s so easy to end up driven by your priors.
Hi Nicole,
I think this is a cool choice and a good post—thanks for both! I agree with your bottom line that kidney donation can be a good choice for EAs and just wanted to flag a few additional resources and considerations:
I think these other EA forum posts about the costs and benefits of donation are worth checking out. In my mind the most important update relative to when I donated is that the best long-run studies now suggest a roughly 1 percentage point increase in later-life risk of kidney failure because of donating. I think that translates less than 1:1 to mortality for a variety of reasons (ability to get a transplant, maybe xenotransplantation or other things will be easy in 20-50 years) but I think that factor probably swamps the near-term (roughly 1⁄3,000) risk of death in surgery when thinking about the EV calculation.
I think I took ~3 weeks off work to recover from donation (it was also around the holidays for me), and I think for folks who work in altruistic jobs that may dominate the cost calculation. 52 hours seems like a very low estimate of the expected time cost to me all in though.
I think people sometimes assume that the original donor gets full counterfactual “credit” for all the steps in a chain. My read of this evidence is that even though average chain length is ~4, the marginal social value of an altruistic donor starting a chain is “only” ~.8-1.7 transplants (depending on blood type) because the relevant counterfactual can be other chains being longer.
I think things like this post are themselves a pretty important channel for impact. I think the impact of my personal donation was dominated by the small influence I had on getting Dylan Matthews to donate, which then had a big knock-on impact because his writing led a number of other people to donate.
Overall, I think these kinds of persuasion considerations can play a weirdly big role in how you evaluate kidney donation, and I don’t have a clear bottom line on which way they cut.