Thanks so much for taking the time to read the post and for really engaging with it. I very much appreciate your comment and I think there are some really good points in it. But based on my understanding of what you wrote, I’m not sure I currently agree with your conclusion. In particular, I think that looking in terms of minimum detectable effect can be a helpful shorthand, but it might be misleading more than it’s helping in this case. We don’t really care about getting statistical significance at p <0.05 in a replication, especially given that the primary effects seen in Hamory et al. (2021) weren’t significant at that level. Rather, we care about the magnitude of the update we’d make in response to new trial data.
To give a sense of why that’s so different, I want to start off with an oversimplified example. Consider two well-calibrated normal priors, one with a mean effect of 10 and standard deviation of 0.5, and one with a mean effect of 0.2 and the same standard deviation. By the simplified MDE criterion, a trial with a standard error of 3.5 would be required to detect the effect at p <0.05 80% of the time in the first case and a trial with a standard error of 0.07 would be required to detect the effect at p <0.05 80% of the time in the second case. But we would update our estimate of the mean by the same amount in the second case as in the first case if new trial data came in with a certain standard error and difference between its mean estimate and our prior mean. (The situation for deworming is more complex because the prior distribution is probably truncated at around zero. But I think the basic concept still holds, in that the sample size required to keep the same value of new information wouldn’t grow as fast as the sample size required to keep the same statistical power.)
Therefore, I don’t think the required sample size is likely to be nearly as big as you estimated in order to get a valuable update to GiveWell’s current cost-effectiveness estimate. However, your point is clearly correct in that the sample size will need to increase to handle the worm burden effect. That was something I hadn’t thought about in the original post, so I really appreciate you bringing it up in your comment. According to GiveWell, the highest-worm-burden regions in which Deworm the World operates (Kenya and Ogun State, Nigeria) have a worm burden adjustment of 20.5%. A replication trial would likely need to be substantially larger to account for that lower burden, but I don’t think that increase would be prohibitively large.
Regarding the replicability adjustment, I’m not sure it implies that a larger sample size would be needed to make a substantial update based on new trial data (separate from the larger sample needed to handle the worm burden effect). The replicability adjustment was arrived at by starting with a prior based on short-term effect data and performing a bayesian update based on the Miguel and Kremer followup results. If the follow-up study has the same statistical power as M&K, then the two can be pooled to make the update and they should be given equal weight.
Thinking about it qualitatively, if a replication trial showed a similar or greater effect size than Hamory et al. (2021) after accounting for the difference in worm burden, I would think that would imply a strong update away from GiveWell’s current replicability adjustment of 0.13. In fact, it might even suggest that deworming worked via an alternate mechanism than the ones considered in the analysis underlying GiveWell’s adjustment. On the flip side, I don’t think that GiveWell would be recommending deworming if the Miguel and Kremer follow-ups had found a point estimate of zero for the relevant effect sizes (the entire cost-effectiveness model starts with the Hamory et al. numbers and adjusts them). So if a replication study came in with a negative point estimate for the effect size, GiveWell should probably update noticeably towards zero.
Zooming out, I think that information on deworming’s effectiveness in the presence of current worm burdens and health conditions would be very valuable. GiveWell has done an admirable job of trying to extrapolate from the Miguel and Kremer trial and its follow-ups to a bunch of extremely different environments, but they’re changing the point estimate by a factor of ~66 in doing so. To me, that implies that there’s really tremendous uncertainty here, and that even imperfect evidence in the current environment would be very useful. Since deworming is so cheap, I’m particularly worried about the case where it’s noticeably more effective than GiveWell is currently estimating, in which case EA donors would be leaving a big opportunity to do good on the table.
Thank you again for taking the time to read the post!
Thanks MHR. I agree that one shouldn’t need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you’re updating on a similarly-to-MK-powered trial, you’re almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.
I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn’t surprise me if one of the conclusions was that I’m misinterpreting GiveWell’s current views, or that it’s hard to articulate a formal prior that gets you from the MK results to GiveWell’s current views.
Thanks so much for taking the time to read the post and for really engaging with it. I very much appreciate your comment and I think there are some really good points in it. But based on my understanding of what you wrote, I’m not sure I currently agree with your conclusion. In particular, I think that looking in terms of minimum detectable effect can be a helpful shorthand, but it might be misleading more than it’s helping in this case. We don’t really care about getting statistical significance at p <0.05 in a replication, especially given that the primary effects seen in Hamory et al. (2021) weren’t significant at that level. Rather, we care about the magnitude of the update we’d make in response to new trial data.
To give a sense of why that’s so different, I want to start off with an oversimplified example. Consider two well-calibrated normal priors, one with a mean effect of 10 and standard deviation of 0.5, and one with a mean effect of 0.2 and the same standard deviation. By the simplified MDE criterion, a trial with a standard error of 3.5 would be required to detect the effect at p <0.05 80% of the time in the first case and a trial with a standard error of 0.07 would be required to detect the effect at p <0.05 80% of the time in the second case. But we would update our estimate of the mean by the same amount in the second case as in the first case if new trial data came in with a certain standard error and difference between its mean estimate and our prior mean. (The situation for deworming is more complex because the prior distribution is probably truncated at around zero. But I think the basic concept still holds, in that the sample size required to keep the same value of new information wouldn’t grow as fast as the sample size required to keep the same statistical power.)
Therefore, I don’t think the required sample size is likely to be nearly as big as you estimated in order to get a valuable update to GiveWell’s current cost-effectiveness estimate. However, your point is clearly correct in that the sample size will need to increase to handle the worm burden effect. That was something I hadn’t thought about in the original post, so I really appreciate you bringing it up in your comment. According to GiveWell, the highest-worm-burden regions in which Deworm the World operates (Kenya and Ogun State, Nigeria) have a worm burden adjustment of 20.5%. A replication trial would likely need to be substantially larger to account for that lower burden, but I don’t think that increase would be prohibitively large.
Regarding the replicability adjustment, I’m not sure it implies that a larger sample size would be needed to make a substantial update based on new trial data (separate from the larger sample needed to handle the worm burden effect). The replicability adjustment was arrived at by starting with a prior based on short-term effect data and performing a bayesian update based on the Miguel and Kremer followup results. If the follow-up study has the same statistical power as M&K, then the two can be pooled to make the update and they should be given equal weight.
Thinking about it qualitatively, if a replication trial showed a similar or greater effect size than Hamory et al. (2021) after accounting for the difference in worm burden, I would think that would imply a strong update away from GiveWell’s current replicability adjustment of 0.13. In fact, it might even suggest that deworming worked via an alternate mechanism than the ones considered in the analysis underlying GiveWell’s adjustment. On the flip side, I don’t think that GiveWell would be recommending deworming if the Miguel and Kremer follow-ups had found a point estimate of zero for the relevant effect sizes (the entire cost-effectiveness model starts with the Hamory et al. numbers and adjusts them). So if a replication study came in with a negative point estimate for the effect size, GiveWell should probably update noticeably towards zero.
Zooming out, I think that information on deworming’s effectiveness in the presence of current worm burdens and health conditions would be very valuable. GiveWell has done an admirable job of trying to extrapolate from the Miguel and Kremer trial and its follow-ups to a bunch of extremely different environments, but they’re changing the point estimate by a factor of ~66 in doing so. To me, that implies that there’s really tremendous uncertainty here, and that even imperfect evidence in the current environment would be very useful. Since deworming is so cheap, I’m particularly worried about the case where it’s noticeably more effective than GiveWell is currently estimating, in which case EA donors would be leaving a big opportunity to do good on the table.
Thank you again for taking the time to read the post!
Thanks MHR. I agree that one shouldn’t need to insist on statistical significance, but if GiveWell thinks that the actual expected effect is ~12% of the MK result, then I think if you’re updating on a similarly-to-MK-powered trial, you’re almost to the point of updating on a coinflip because of how underpowered you are to detect the expected effect.
I agree it would be useful to do this in a more formal bayesian framework which accurately characterizes the GW priors. It wouldn’t surprise me if one of the conclusions was that I’m misinterpreting GiveWell’s current views, or that it’s hard to articulate a formal prior that gets you from the MK results to GiveWell’s current views.