I really appreciate substantive posts like this, thanks!
This response is just speaking for myself, doing rough math on the weekend that I haven’t run by anyone else. Someone (e.g., from @GiveWell) should correct me if I’m wrong, but I think you’re vastly understating the difficulty and cost of running an informative replication given the situation on deworming. (My math below seems intuitively too pessimistic, so I welcome corrections!)
If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).
I didn’t check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a “main effect” of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell’s most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:
A replicability adjustment of .13 (row 11)
A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)
Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we’d have an MDE of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we’d need the standard error to be .0017/2.8=~.00061 log points. So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs.
But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, getting a 108x smaller standard error would require a 108*108=11,664 times larger sample. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.
If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars. And of course it would take well over a decade to get the long term followup results here. (That said, it wouldn’t surprise me if I’m getting the math wrong here—someone please flag if so!)
I’m sure there are better study designs than the one I’m implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I’m skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs.
I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they’re so hard to study and it’s so easy to end up driven by your priors.
Thanks so much for taking the time to read the post and for really engaging with it. I very much appreciate your comment and I think there are some really good points in it. But based on my understanding of what you wrote, I’m not sure I currently agree with your conclusion. In particular, I think that looking in terms of minimum detectable effect can be a helpful shorthand, but it might be misleading more than it’s helping in this case. We don’t really care about getting statistical significance at p <0.05 in a replication, especially given that the primary effects seen in Hamory et al. (2021) weren’t significant at that level. Rather, we care about the magnitude of the update we’d make in response to new trial data.
To give a sense of why that’s so different, I want to start off with an oversimplified example. Consider two well-calibrated normal priors, one with a mean effect of 10 and standard deviation of 0.5, and one with a mean effect of 0.2 and the same standard deviation. By the simplified MDE criterion, a trial with a standard error of 3.5 would be required to detect the effect at p <0.05 80% of the time in the first case and a trial with a standard error of 0.07 would be required to detect the effect at p <0.05 80% of the time in the second case. But we would update our estimate of the mean by the same amount in the second case as in the first case if new trial data came in with a certain standard error and difference between its mean estimate and our prior mean. (The situation for deworming is more complex because the prior distribution is probably truncated at around zero. But I think the basic concept still holds, in that the sample size required to keep the same value of new information wouldn’t grow as fast as the sample size required to keep the same statistical power.)
Therefore, I don’t think the required sample size is likely to be nearly as big as you estimated in order to get a valuable update to GiveWell’s current cost-effectiveness estimate. However, your point is clearly correct in that the sample size will need to increase to handle the worm burden effect. That was something I hadn’t thought about in the original post, so I really appreciate you bringing it up in your comment. According to GiveWell, the highest-worm-burden regions in which Deworm the World operates (Kenya and Ogun State, Nigeria) have a worm burden adjustment of 20.5%. A replication trial would likely need to be substantially larger to account for that lower burden, but I don’t think that increase would be prohibitively large.
Regarding the replicability adjustment, I’m not sure it implies that a larger sample size would be needed to make a substantial update based on new trial data (separate from the larger sample needed to handle the worm burden effect). The replicability adjustment was arrived at by starting with a prior based on short-term effect data and performing a bayesian update based on the Miguel and Kremer followup results. If the follow-up study has the same statistical power as M&K, then the two can be pooled to make the update and they should be given equal weight.
Thinking about it qualitatively, if a replication trial showed a similar or greater effect size than Hamory et al. (2021) after accounting for the difference in worm burden, I would think that would imply a strong update away from GiveWell’s current replicability adjustment of 0.13. In fact, it might even suggest that deworming worked via an alternate mechanism than the ones considered in the analysis underlying GiveWell’s adjustment. On the flip side, I don’t think that GiveWell would be recommending deworming if the Miguel and Kremer follow-ups had found a point estimate of zero for the relevant effect sizes (the entire cost-effectiveness model starts with the Hamory et al. numbers and adjusts them). So if a replication study came in with a negative point estimate for the effect size, GiveWell should probably update noticeably towards zero.
Zooming out, I think that information on deworming’s effectiveness in the presence of current worm burdens and health conditions would be very valuable. GiveWell has done an admirable job of trying to extrapolate from the Miguel and Kremer trial and its follow-ups to a bunch of extremely different environments, but they’re changing the point estimate by a factor of ~66 in doing so. To me, that implies that there’s really tremendous uncertainty here, and that even imperfect evidence in the current environment would be very useful. Since deworming is so cheap, I’m particularly worried about the case where it’s noticeably more effective than GiveWell is currently estimating, in which case EA donors would be leaving a big opportunity to do good on the table.
Thank you again for taking the time to read the post!
Thanks MHR. I agree that one shouldn’t need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you’re updating on a similarly-to-MK-powered trial, you’re almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.
I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn’t surprise me if one of the conclusions was that I’m misinterpreting GiveWell’s current views, or that it’s hard to articulate a formal prior that gets you from the MK results to GiveWell’s current views.
I think your estimate of how costly it would be to run a replication study is too pessimistic. In addition to the issues that MHR identified, it strikes me as unrealistic that the cost of rerunning the data collection would be more than 10,000 times as high as the cost of the original research project. I think this is highly unlikely because data collection usually accounts for at most 10% of the cost of research. Moreover, the cost of data collection does not scale linearly with the number of participants, but linearly in the number of researchers that are paid to coordinate data collection. The most difficult parts of organizing data collection, such as developing the strategy and establishing contact with high-ranking relevant officials, only have to be done once. Moreover, there are economies of scale such that once you can collect data from 1 school, it is very little effort to replicate the process with 100 or 1000 schools, and that work can then be done by local volunteers with minimal training for minimal pay or free of charge. It certainly won’t require 10000 times as many professors, postdocs, and graduate students as the original study, and it is almost exclusively the salaries of those people that makes research expensive. To the contrary, collecting more data on an already designed study with an existing data analysis pipeline requires minimal work from the scientists themselves, and that makes it much less expensive. Therefore, I think that the cost of data collection was probably only 10% of the cost of the research project and only scale logarithmically with the sample size. Based on that line of reasoning, I believe that the replication study could be conducted for one or a few million dollars.
Hi MHR,
I really appreciate substantive posts like this, thanks!
This response is just speaking for myself, doing rough math on the weekend that I haven’t run by anyone else. Someone (e.g., from @GiveWell) should correct me if I’m wrong, but I think you’re vastly understating the difficulty and cost of running an informative replication given the situation on deworming. (My math below seems intuitively too pessimistic, so I welcome corrections!)
If you look at slide 58 here you get the minimum detectable effect (MDE) size with 80% power can be approximated as 2.8*the standard error (which is itself effectively inversely proportional to the square of the sample size).
I didn’t check the original sources, but this GiveWell doc on their deworming replicability adjustment implies that the standard error for log(income/consumption) in the most recent replications is ~.066 (on a “main effect” of .109). The original RCT involved 75 schools, and according to figure A1 here the followup KLPS 4 involved surveying 4,135 participants in the original trial. GiveWell’s most recent cost-effectiveness analysis for Deworm the World makes 2 key adjustments to the main effect from the RCT:
A replicability adjustment of .13 (row 11)
A geography-specific adjustment for worm burden which averages about .12 (row 40) (this is because worm burdens are now much lower than they were at the time of MK)
Together, these adjustments imply that GiveWell projects the per-capita benefit to the people dewormed to be just .13*.12=1.56% of the .109 impact on log income in the late followups to the original Miguel and Kremer RCT. So if we wanted to detect the effect GiveWell expects to see in mass deworming, we’d have an MDE of ~.0017 on log income, which with 80% power and the formula above (MDE=2.8*standard error) implies we’d need the standard error to be .0017/2.8=~.00061 log points. So a well-powered study to get the effect GiveWell expects would need a standard error roughly 108 times smaller than the standard error (.066) GiveWell calculates on the actual followup RCTs.
But because standard errors are inversely proportional to the square root of sample size, if you used the same study design, getting a 108x smaller standard error would require a 108*108=11,664 times larger sample. I think that might imply a sample size of ~all the elementary schools in India (11,664*75=874K), which would presumably include many schools that do not in fact actually have significant worm burdens.
If the original MK study and one followup cost $1M (which I think is the right order of magnitude but may be too high or too low), this implies that a followup powered to find the effect GiveWell expects would cost many billions of dollars. And of course it would take well over a decade to get the long term followup results here. (That said, it wouldn’t surprise me if I’m getting the math wrong here—someone please flag if so!)
I’m sure there are better study designs than the one I’m implicitly modeling here that could generate more power, or places where worm burdens are still high enough to make this somewhat more economical, but I’m skeptical they can overcome the fundamental difficulty of detecting small effects in cluster RCTs.
I think a totally reasonable reaction to this is to be more skeptical of small cheap interventions, because they’re so hard to study and it’s so easy to end up driven by your priors.
Thanks so much for taking the time to read the post and for really engaging with it. I very much appreciate your comment and I think there are some really good points in it. But based on my understanding of what you wrote, I’m not sure I currently agree with your conclusion. In particular, I think that looking in terms of minimum detectable effect can be a helpful shorthand, but it might be misleading more than it’s helping in this case. We don’t really care about getting statistical significance at p <0.05 in a replication, especially given that the primary effects seen in Hamory et al. (2021) weren’t significant at that level. Rather, we care about the magnitude of the update we’d make in response to new trial data.
To give a sense of why that’s so different, I want to start off with an oversimplified example. Consider two well-calibrated normal priors, one with a mean effect of 10 and standard deviation of 0.5, and one with a mean effect of 0.2 and the same standard deviation. By the simplified MDE criterion, a trial with a standard error of 3.5 would be required to detect the effect at p <0.05 80% of the time in the first case and a trial with a standard error of 0.07 would be required to detect the effect at p <0.05 80% of the time in the second case. But we would update our estimate of the mean by the same amount in the second case as in the first case if new trial data came in with a certain standard error and difference between its mean estimate and our prior mean. (The situation for deworming is more complex because the prior distribution is probably truncated at around zero. But I think the basic concept still holds, in that the sample size required to keep the same value of new information wouldn’t grow as fast as the sample size required to keep the same statistical power.)
Therefore, I don’t think the required sample size is likely to be nearly as big as you estimated in order to get a valuable update to GiveWell’s current cost-effectiveness estimate. However, your point is clearly correct in that the sample size will need to increase to handle the worm burden effect. That was something I hadn’t thought about in the original post, so I really appreciate you bringing it up in your comment. According to GiveWell, the highest-worm-burden regions in which Deworm the World operates (Kenya and Ogun State, Nigeria) have a worm burden adjustment of 20.5%. A replication trial would likely need to be substantially larger to account for that lower burden, but I don’t think that increase would be prohibitively large.
Regarding the replicability adjustment, I’m not sure it implies that a larger sample size would be needed to make a substantial update based on new trial data (separate from the larger sample needed to handle the worm burden effect). The replicability adjustment was arrived at by starting with a prior based on short-term effect data and performing a bayesian update based on the Miguel and Kremer followup results. If the follow-up study has the same statistical power as M&K, then the two can be pooled to make the update and they should be given equal weight.
Thinking about it qualitatively, if a replication trial showed a similar or greater effect size than Hamory et al. (2021) after accounting for the difference in worm burden, I would think that would imply a strong update away from GiveWell’s current replicability adjustment of 0.13. In fact, it might even suggest that deworming worked via an alternate mechanism than the ones considered in the analysis underlying GiveWell’s adjustment. On the flip side, I don’t think that GiveWell would be recommending deworming if the Miguel and Kremer follow-ups had found a point estimate of zero for the relevant effect sizes (the entire cost-effectiveness model starts with the Hamory et al. numbers and adjusts them). So if a replication study came in with a negative point estimate for the effect size, GiveWell should probably update noticeably towards zero.
Zooming out, I think that information on deworming’s effectiveness in the presence of current worm burdens and health conditions would be very valuable. GiveWell has done an admirable job of trying to extrapolate from the Miguel and Kremer trial and its follow-ups to a bunch of extremely different environments, but they’re changing the point estimate by a factor of ~66 in doing so. To me, that implies that there’s really tremendous uncertainty here, and that even imperfect evidence in the current environment would be very useful. Since deworming is so cheap, I’m particularly worried about the case where it’s noticeably more effective than GiveWell is currently estimating, in which case EA donors would be leaving a big opportunity to do good on the table.
Thank you again for taking the time to read the post!
Thanks MHR. I agree that one shouldn’t need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you’re updating on a similarly-to-MK-powered trial, you’re almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.
I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn’t surprise me if one of the conclusions was that I’m misinterpreting GiveWell’s current views, or that it’s hard to articulate a formal prior that gets you from the MK results to GiveWell’s current views.
I think your estimate of how costly it would be to run a replication study is too pessimistic. In addition to the issues that MHR identified, it strikes me as unrealistic that the cost of rerunning the data collection would be more than 10,000 times as high as the cost of the original research project. I think this is highly unlikely because data collection usually accounts for at most 10% of the cost of research. Moreover, the cost of data collection does not scale linearly with the number of participants, but linearly in the number of researchers that are paid to coordinate data collection. The most difficult parts of organizing data collection, such as developing the strategy and establishing contact with high-ranking relevant officials, only have to be done once. Moreover, there are economies of scale such that once you can collect data from 1 school, it is very little effort to replicate the process with 100 or 1000 schools, and that work can then be done by local volunteers with minimal training for minimal pay or free of charge. It certainly won’t require 10000 times as many professors, postdocs, and graduate students as the original study, and it is almost exclusively the salaries of those people that makes research expensive. To the contrary, collecting more data on an already designed study with an existing data analysis pipeline requires minimal work from the scientists themselves, and that makes it much less expensive. Therefore, I think that the cost of data collection was probably only 10% of the cost of the research project and only scale logarithmically with the sample size. Based on that line of reasoning, I believe that the replication study could be conducted for one or a few million dollars.