Great comment. I don’t think anyone, myself included, would say the means are not the same and therefore everything is terrible. In the podcast, you can see my reluctance to that when Rob is trying to get me to give one number that will easily summarize how much results in one context will extrapolate to another, and I just don’t want to play ball (which is not at all to criticize!). The number I tend to focus on these days (tau squared) is not one that is easily interpretable in that way—instead, it’s a measure of the unexplained variation in results—but how much is unexplained clearly depends on what model you are using (and because it is a variance, it really depends on units, making it hard to interpret across interventions except for those dealing with the same kind of outcome). On this view, if you can come up with a great model to explain away more of the heterogeneity, great! I am all for models that have better predictive power.
On the other hand:
1) I do worry that often people are not building more complicated models, but rather thinking about a specific study (if lucky, a group of studies), most likely being biased towards those which found particularly large effects as people seem to update more on positive results.
2) I am not convinced that focusing on mechanisms will completely solve the problem. I agree that interventions that are more theory-based should (in theory) have more similar results—or at least results that are better able to be predicted, which is more to the point. On the other hand, implementation details matter. I agree with Glennerster and Bates that there is an undue focus on setting—everyone wants an impact evaluation done in their particular location. But I think there is too much focus on setting because (perhaps surprisingly) when I look in the AidGrade data, there is little to no effect of geography on the impact found, by which I mean that a result from (say) Kenya does not even generalize to Kenya very well (and I believe James Rising and co-authors have found similar results using a case study of conditional cash transfers). This isn’t always going to be true; for example, the effect of health interventions depend on the baseline prevalence of disease, and baseline prevalences can be geographically clustered. But what I worry—without convincing evidence yet so take this with a grain of salt—is that small implementation details might frequently wash out the effects of knowing the mechanisms. Hopefully, we will have more evidence on this in the future (whichever way that evidence goes), and I very much hope that the more positive view turns out to be true.
I do agree with you that it’s possible that researchers (and policymakers?) are able to account for some of the other factors when making predictions. I also said that there was some evidence that people were updating more on the positive results; I need to dig into the data a bit more to do subgroup analyses, but one way to reconcile these results (which would be consistent with what I have seen using different data) is that some people may be better at it than others. There are definitely times when people are wildly off, as well. I don’t think I have a good enough sense yet of when predictions are good and when they are not, and that would be valuable.
Edit: I meant to add, there are a lot of frameworks that people use to try to get a handle on when they can export results or how to generalize. In addition to the work cited in Glennerster and Bates, see Williams for another example. And talking with people in government, there are a lot of other one-off frameworks or approaches people use internally. I am a fan of this kind of work and think it highly necessary, even though I am quite confident it won’t get the appreciation it deserves within academia.
Great comment. I don’t think anyone, myself included, would say the means are not the same and therefore everything is terrible. In the podcast, you can see my reluctance to that when Rob is trying to get me to give one number that will easily summarize how much results in one context will extrapolate to another, and I just don’t want to play ball (which is not at all to criticize!). The number I tend to focus on these days (tau squared) is not one that is easily interpretable in that way—instead, it’s a measure of the unexplained variation in results—but how much is unexplained clearly depends on what model you are using (and because it is a variance, it really depends on units, making it hard to interpret across interventions except for those dealing with the same kind of outcome). On this view, if you can come up with a great model to explain away more of the heterogeneity, great! I am all for models that have better predictive power.
On the other hand:
1) I do worry that often people are not building more complicated models, but rather thinking about a specific study (if lucky, a group of studies), most likely being biased towards those which found particularly large effects as people seem to update more on positive results.
2) I am not convinced that focusing on mechanisms will completely solve the problem. I agree that interventions that are more theory-based should (in theory) have more similar results—or at least results that are better able to be predicted, which is more to the point. On the other hand, implementation details matter. I agree with Glennerster and Bates that there is an undue focus on setting—everyone wants an impact evaluation done in their particular location. But I think there is too much focus on setting because (perhaps surprisingly) when I look in the AidGrade data, there is little to no effect of geography on the impact found, by which I mean that a result from (say) Kenya does not even generalize to Kenya very well (and I believe James Rising and co-authors have found similar results using a case study of conditional cash transfers). This isn’t always going to be true; for example, the effect of health interventions depend on the baseline prevalence of disease, and baseline prevalences can be geographically clustered. But what I worry—without convincing evidence yet so take this with a grain of salt—is that small implementation details might frequently wash out the effects of knowing the mechanisms. Hopefully, we will have more evidence on this in the future (whichever way that evidence goes), and I very much hope that the more positive view turns out to be true.
I do agree with you that it’s possible that researchers (and policymakers?) are able to account for some of the other factors when making predictions. I also said that there was some evidence that people were updating more on the positive results; I need to dig into the data a bit more to do subgroup analyses, but one way to reconcile these results (which would be consistent with what I have seen using different data) is that some people may be better at it than others. There are definitely times when people are wildly off, as well. I don’t think I have a good enough sense yet of when predictions are good and when they are not, and that would be valuable.
Edit: I meant to add, there are a lot of frameworks that people use to try to get a handle on when they can export results or how to generalize. In addition to the work cited in Glennerster and Bates, see Williams for another example. And talking with people in government, there are a lot of other one-off frameworks or approaches people use internally. I am a fan of this kind of work and think it highly necessary, even though I am quite confident it won’t get the appreciation it deserves within academia.