Nice to see a rare substantive post in the sea of meta ;)
This is a serious transparency issue
Weighting a strong prior over evidence is a serious decision that GiveWell absolutely should have justified in public documents. I actually think that is a really important point to highlight above and beyond the specifics of this particular CEA. Grantmakers like GiveWell enjoy a really large influence over EA giving and activities, and without strong transparency norms over how they make their decisions, we risk human error amplifying into very large misallocations of resources.
The only reason I could imagine not to share this justification is because it would increase researcher burden from an already-extensive CEA. In the tradeoff between researcher burden and transparency, I think there’s a pretty high bar to prioritize researcher burden, and this does not seem like it meets that bar. I would strongly support GiveWell publishing this justification and making it a norm for all future CEAs to justify exclusion of evidence in a model.
Use prior distributions to shrink estimates, not non-specific discount factors
CEAs commonly use Bayesian hierarchical modelling to estimate true treatment effects from a small number of studies. To get these estimates requires plugging in a prior. If GiveWell thinks that 0.885 log units is unreasonable and 0.113 is more reasonable, then they should formalize that upfront in a prior distribution over the true treatment effect.
Yes, you can still tinker with a prior ex-post so it doesn’t solve the issue completely. Yes, a shrinking of the estimate is technically equivalent to applying a prior distribution that places low weight on 0.885 and higher weights on lower values. That doesn’t mean these are practically equivalent procedures. It’s much easier for motivated reasoning to slip into ex-post adjustments. Moreover, ex-post adjustments are impossible to interpret except in how they change the final estimate, whereas people can interpret priors before the analysis is done and debate whether they are reasonable. So I don’t think non-specific discounting of estimates is a good practice.
However, I think your claim that discounting should only be based on specific factors is too strong. Any reasonable prior over the true treatment effect is discounting some treatment estimates just because of their “unreasonable” magnitude. CEAs aren’t mechanical and shouldn’t be treated as such in the name of following the evidence.
Both your calculations and GiveWell’s should count consumption benefits over the lifetime, not 40 years
40 years may be the working lives of the subjects, but benefits accrued while working will almost certainly increase consumption beyond working age. First, savings accumulated while working will be consumed after retirement. Moreover, higher income while working likely leads children to earn more and thus support their parents better. So consumption benefits in the working life are likely substantially understate the real consumption benefits of any treatment.
I would understand ignoring or substantially discounting consumption benefits to the future children of deworming recipients because of decay—but consumption benefits should still be counted over people’s whole lifetime, say 55 years. I could not find a spreadsheet with your calculations to plug a longer time horizon in, but I would be very curious to see how your results change with a longer window of effects.
On non-specific discount factors: one approach which I was interested in when doing a lot of this work was to use estimates we have of how much effect sizes shrink when more and/or larger studies are conducted.
For example, in this paper Eva Vivalt, using a sample of impact evaluations, regresses effect size on variables like number of studies and sample size. As one would expect, the larger the sample size, the smaller the estimated effect size. I always wondered if you could just use the regression coefficients she presents to estimate how much an effect size would be expected to shrink if one conducted a larger study.
I don’t think this strikes at exactly what you or HLI are trying to get at. But I do think it’s valuable to ponder how we might get at “principled” discount rates, given all we know about the validity problems these studies exhibit.
For example, in this paper Eva Vivalt, using a sample of impact evaluations, regresses effect size on variables like number of studies and sample size. As one would expect, the larger the sample size, the smaller the estimated effect size. I always wondered if you could just use the regression coefficients she presents to estimate how much an effect size would be expected to shrink if one conducted a larger study.
This is an interesting idea, but a note of caution here is that effect sizes could shrink in larger studies for 2 reasons: 1. the “good” reasons of less publication bias and more power, etc, 2. the “bad” (bias) reasons that larger studies may be more likely to be implemented more loosely (maybe by government rather than a motivated NGO, for example). The latter issue isn’t statistical, it’s that a genuinely different treatment is being applied.
Whether or not this matters depends on exactly the question you’re asking, but there is some risk in blurring the two sources of shrinkage in effect sizes over the size of the study.
Yep, totally agree that this would be tricky! There’d be a lot of details to think through. I would note that Vivalt does run regressions where, e.g., the kind of organization implementing the program (government vs NGO) is included as a covariate, and the coefficient on sample size doesn’t change much (-0.011 vs −0.013 in the single linear regression; see table 7, p. 31).
I am not sure if I’m a fan of this, because the true effect of an intervention will vary across place in ways that will affect the results of future studies, but shouldn’t affect our assessment of deworming in this context. So future studies might extend deworming to countries where worm burdens are lower and thus find lower effects of deworming. But it would be a mistake to conclude that deworming in Kenya is less effective based on those studies.
You might say we can control for the country being studied, but that is only the case if there are many studies in one country, which is rarely the case.
Nice to see a rare substantive post in the sea of meta ;)
This is a serious transparency issue
Weighting a strong prior over evidence is a serious decision that GiveWell absolutely should have justified in public documents. I actually think that is a really important point to highlight above and beyond the specifics of this particular CEA. Grantmakers like GiveWell enjoy a really large influence over EA giving and activities, and without strong transparency norms over how they make their decisions, we risk human error amplifying into very large misallocations of resources.
The only reason I could imagine not to share this justification is because it would increase researcher burden from an already-extensive CEA. In the tradeoff between researcher burden and transparency, I think there’s a pretty high bar to prioritize researcher burden, and this does not seem like it meets that bar. I would strongly support GiveWell publishing this justification and making it a norm for all future CEAs to justify exclusion of evidence in a model.
Use prior distributions to shrink estimates, not non-specific discount factors
CEAs commonly use Bayesian hierarchical modelling to estimate true treatment effects from a small number of studies. To get these estimates requires plugging in a prior. If GiveWell thinks that 0.885 log units is unreasonable and 0.113 is more reasonable, then they should formalize that upfront in a prior distribution over the true treatment effect.
Yes, you can still tinker with a prior ex-post so it doesn’t solve the issue completely. Yes, a shrinking of the estimate is technically equivalent to applying a prior distribution that places low weight on 0.885 and higher weights on lower values. That doesn’t mean these are practically equivalent procedures. It’s much easier for motivated reasoning to slip into ex-post adjustments. Moreover, ex-post adjustments are impossible to interpret except in how they change the final estimate, whereas people can interpret priors before the analysis is done and debate whether they are reasonable. So I don’t think non-specific discounting of estimates is a good practice.
However, I think your claim that discounting should only be based on specific factors is too strong. Any reasonable prior over the true treatment effect is discounting some treatment estimates just because of their “unreasonable” magnitude. CEAs aren’t mechanical and shouldn’t be treated as such in the name of following the evidence.
Both your calculations and GiveWell’s should count consumption benefits over the lifetime, not 40 years
40 years may be the working lives of the subjects, but benefits accrued while working will almost certainly increase consumption beyond working age. First, savings accumulated while working will be consumed after retirement. Moreover, higher income while working likely leads children to earn more and thus support their parents better. So consumption benefits in the working life are likely substantially understate the real consumption benefits of any treatment.
I would understand ignoring or substantially discounting consumption benefits to the future children of deworming recipients because of decay—but consumption benefits should still be counted over people’s whole lifetime, say 55 years. I could not find a spreadsheet with your calculations to plug a longer time horizon in, but I would be very curious to see how your results change with a longer window of effects.
On non-specific discount factors: one approach which I was interested in when doing a lot of this work was to use estimates we have of how much effect sizes shrink when more and/or larger studies are conducted.
For example, in this paper Eva Vivalt, using a sample of impact evaluations, regresses effect size on variables like number of studies and sample size. As one would expect, the larger the sample size, the smaller the estimated effect size. I always wondered if you could just use the regression coefficients she presents to estimate how much an effect size would be expected to shrink if one conducted a larger study.
I don’t think this strikes at exactly what you or HLI are trying to get at. But I do think it’s valuable to ponder how we might get at “principled” discount rates, given all we know about the validity problems these studies exhibit.
This is an interesting idea, but a note of caution here is that effect sizes could shrink in larger studies for 2 reasons: 1. the “good” reasons of less publication bias and more power, etc, 2. the “bad” (bias) reasons that larger studies may be more likely to be implemented more loosely (maybe by government rather than a motivated NGO, for example). The latter issue isn’t statistical, it’s that a genuinely different treatment is being applied.
Whether or not this matters depends on exactly the question you’re asking, but there is some risk in blurring the two sources of shrinkage in effect sizes over the size of the study.
Yep, totally agree that this would be tricky! There’d be a lot of details to think through. I would note that Vivalt does run regressions where, e.g., the kind of organization implementing the program (government vs NGO) is included as a covariate, and the coefficient on sample size doesn’t change much (-0.011 vs −0.013 in the single linear regression; see table 7, p. 31).
I am not sure if I’m a fan of this, because the true effect of an intervention will vary across place in ways that will affect the results of future studies, but shouldn’t affect our assessment of deworming in this context. So future studies might extend deworming to countries where worm burdens are lower and thus find lower effects of deworming. But it would be a mistake to conclude that deworming in Kenya is less effective based on those studies.
You might say we can control for the country being studied, but that is only the case if there are many studies in one country, which is rarely the case.