I love this analysis, and I think it highlights how important model choice can be. Both constant and decaying treatment effects seem plausible to me. Instead of choosing only one of the two models (constant effect or decaying effect) and estimating as though this were the truth, a middle option is Bayesian model averaging:
Along with the prior over the parameters in the models, you also have a prior over the models themselves (eg 50⁄50 constant vs decaying effect). The data will support some models more than others and so you get a posterior distribution over the models based on the data (say 33⁄67 constant vs decaying effect). The posterior probability that each model is true is the weight you give each model when you estimate the treatment effect you’re interested in. It’s a formal way of incorporating model uncertainty into the estimate, which would allow others to adjust the analysis based on their priors on the correct model (presumably GiveWell would start with a larger prior on constant effects, you would start with a larger prior on decaying effects).
One (small) reason one might start with a larger prior on the constant effects model is to favor simplicity. In Bayesian model averaging when researchers don’t assign equal priors to all models, I think the second most common decision is to penalize the model with more parameters in favor of the simpler model, which would mean having a larger prior on the constant effects model in this case. I share your prior that decaying effects of economic programs in adults is the norm; I think it’s less clear for early childhood interventions that have positive effects in adulthood. This paper has a review of some qualifying interventions (including deworming) - I’d be interested to see if others have multiple long-run waves and whether those also show evidence of decaying effects.
I’m unclear on whether this works since a constant effects model is a decay model, with a decay parameter set to zero. So you’re just setting hyperparameters on the distribution of the decay parameter, which is normal bayesian modelling and not model averaging.
Thanks for this point, I didn’t think clearly about how the models are nested. I think that means the BMA I describe is the same as having one model with a decay parameter (as you say) but instead of a continuous prior on the decay parameter the prior is a mixture model with a point mass at zero. I know this is one bayesian method for penalizing model complexity, similar to lasso or ridge regression.
So I now realize that what I proposed could just be seen as putting an explicit penalty on the extra parameter needed in the decay model, where the penalty is the size of the point mass. The motivation for that would be to avoid overfitting, which isn’t how I thought of it originally.
Thank you for sharing this and those links. It would be useful to build a quantitative and qualitative summary of how and when early interventions in childhood lead to long-term gains. You can have a positive effect later in life and still have decay (or growth, or constant, or a mix). In our case, we are particularly interested in terms of subjective wellbeing rather than income alone.
One (small) reason one might start with a larger prior on the constant effects model is to favor simplicity
I am a bit rusty on Bayesian model comparison, but—translating from my frequentist knowledge—I think the question isn’t so much whether the model is simpler or not, but how much error adding a parameter reduce? Decay probably seems to fit the data better.
Any model with more degrees of freedom will always fit the data (that you have!) better. A decay model nests a constant effects model, because the decay parameter can be zero.
I love this analysis, and I think it highlights how important model choice can be. Both constant and decaying treatment effects seem plausible to me. Instead of choosing only one of the two models (constant effect or decaying effect) and estimating as though this were the truth, a middle option is Bayesian model averaging:
https://journals.sagepub.com/doi/full/10.1177/2515245919898657
Along with the prior over the parameters in the models, you also have a prior over the models themselves (eg 50⁄50 constant vs decaying effect). The data will support some models more than others and so you get a posterior distribution over the models based on the data (say 33⁄67 constant vs decaying effect). The posterior probability that each model is true is the weight you give each model when you estimate the treatment effect you’re interested in. It’s a formal way of incorporating model uncertainty into the estimate, which would allow others to adjust the analysis based on their priors on the correct model (presumably GiveWell would start with a larger prior on constant effects, you would start with a larger prior on decaying effects).
One (small) reason one might start with a larger prior on the constant effects model is to favor simplicity. In Bayesian model averaging when researchers don’t assign equal priors to all models, I think the second most common decision is to penalize the model with more parameters in favor of the simpler model, which would mean having a larger prior on the constant effects model in this case. I share your prior that decaying effects of economic programs in adults is the norm; I think it’s less clear for early childhood interventions that have positive effects in adulthood. This paper has a review of some qualifying interventions (including deworming) - I’d be interested to see if others have multiple long-run waves and whether those also show evidence of decaying effects.
https://www.nber.org/system/files/working_papers/w25356/w25356.pdf
I’m unclear on whether this works since a constant effects model is a decay model, with a decay parameter set to zero. So you’re just setting hyperparameters on the distribution of the decay parameter, which is normal bayesian modelling and not model averaging.
Thanks for this point, I didn’t think clearly about how the models are nested. I think that means the BMA I describe is the same as having one model with a decay parameter (as you say) but instead of a continuous prior on the decay parameter the prior is a mixture model with a point mass at zero. I know this is one bayesian method for penalizing model complexity, similar to lasso or ridge regression.
https://wesselb.github.io/assets/write-ups/Bruinsma,%20Spike%20and%20Slab%20Priors.pdf
So I now realize that what I proposed could just be seen as putting an explicit penalty on the extra parameter needed in the decay model, where the penalty is the size of the point mass. The motivation for that would be to avoid overfitting, which isn’t how I thought of it originally.
Thank you for sharing this and those links. It would be useful to build a quantitative and qualitative summary of how and when early interventions in childhood lead to long-term gains. You can have a positive effect later in life and still have decay (or growth, or constant, or a mix). In our case, we are particularly interested in terms of subjective wellbeing rather than income alone.
I am a bit rusty on Bayesian model comparison, but—translating from my frequentist knowledge—I think the question isn’t so much whether the model is simpler or not, but how much error adding a parameter reduce? Decay probably seems to fit the data better.
Any model with more degrees of freedom will always fit the data (that you have!) better. A decay model nests a constant effects model, because the decay parameter can be zero.