I liked this a lot. For context, I work as a RA on an impact evaluation project. I have light interests / familiarity with meta-analysis + machine learning, but I did not know what surrogate indices were going into the paper. Some comments below, roughly in order of importance:
Unclear contribution. I feel there’s 3 contributions here: (1) an application of surrogate method to long-term development RCTs, (2) a graduate-level intro to the surrogate method, and (3) a new M-Lasso method which I mostly ignored. I read the paper mostly for the first 2 contributions, so I was surprised to find out that the novel contribution was actually M-Lasso
Missing relevance for “Very Long-Run” Outcomes. Given the mission of Global Priorities Institute, I was thinking throughout how the surrogate method would work when predicting outcomes on a 100-year horizon or 1000-year horizon. Long-run RCTs will get you around the 10-year mark. But presumably, one could apply this technique to some historical econ studies with (I would assume) shaky foundations.
Intuition and layout is good. I followed a lot of this pretty well despite not knowing the fiddly mechanics of many methods. And I had a good idea on what insight I would gain if I dived into the details in each section. It’s also great that the paper led with a graph diagram and progressed from simple kitchen sink regression before going into the black box ML methods.
Estimator properties could use more clarity.
Unsure what “negative bias” is. I don’t know if the “negative bias” in surrogate index is an empirical result arising from this application, or a theoretical result where the estimator is biased in a negative direction. I’m also unsure if this is attenuation (biasing towards 0) or a honest-to-god negative bias. The paper sometimes mentions attenuation and other times negative bias but as far as I can tell, there’s one surrogacy technique used
Is surrogate index biased and inconsistent? Maybe machine learning sees this differently, but I think of estimators as ideally being unbiased and consistent (i.e. consistent meaning more probability mass around the true value as sample size tends to infinity). I get that the surrogate index has a bias of some kind, but I’m unclear on if there’s also the asymptotic property of consistency. And at some point, a limit is mentioned but not what it’s a limit with respect to (larger sample size within each trial is my guess, but I’m not sure)
How would null effects perform? I might be wrong about this but I think normalization of standard errors wouldn’t work if treatment effects are 0...
Got confused on relation between Prentice criterion and regular unconfoundedness. Maybe this is something I just have to sit down and learn one day, but I initially read Prentice criterion as a standard econometric assumption of exogeneity. But then the theory section mentions Prentice criterion (Assumption 3) as distinct from unconfoundedness (Assumption 1). It is good the assumptions are spelt are since that pointed out a bad assumption I was working with but perhaps this can be clarified.
Analogy to Instrumental Variables / mediatorscould use a bit more emphasis. The econometric section (lit review?) buries this analogy towards the end. I’m glad it’s mentioned since it clarifies the first-stage vibes I was getting through the theory section, but I feel it’s (1) possibly a good hook to lead the the theory section and (2) something worth discussing a bit more
Could expand Table 1 with summary counts on outcomes per treatment. 9 RCTs sounds tiny, until I remember that these have giant sample sizes, multiple outcomes, and multiple possible surrogates. A summary table of sample size, outcomes, and surrogates used might give a bit more heft to what’s forming the estimates.
Other stuff I really liked
The “selection bias” in long-term RCTs is cool. I like the paragraph discussing how these results are biased by what gets a long-term RCT. Perhaps it’s good emphasizing this as a limitation in the intro or perhaps it’s a good follow-on paper. Another idea is how surrogates would perform in dynamic effects that grow over time. Urban investments, for example, might have no effect until agglomeration kicks in.
The surprising result of surrogates being more precise than actual RCTs outcomes. This was a pretty good hook for me but I could have easily passed over in in the intro. I also think the result here captures the core intuition of bias-variance tradeoff + surrogate assumption in the paper quite strongly.
Hi Geoffrey, thanks for these comments, they are really helpful as we move to submitting this to journals. Some miscellaneous responses:
I’d definitely be interested in seeing a project where the surrogate index approach is applied to even longer-run settings, especially in econ history as you suggest. You could see this article as testing whether the surrogate index approach works in the medium-run, so thinking about how well it works in the longer-run is a very natural extension. I spent some time thinking about how to do this during my PhD and datasets you might do it with, but didn’t end up having capacity. So if you or anyone else is interested in doing this, please get in touch! That said, I don’t think it makes sense to combine these two projects (econ history and RCTs) into one paper, given the norms of economics articles and subdiscipline boundaries.
4a. The negative bias is purely an empirical result, but one that we expect to rise in many applications. We can’t say for sure whether it’s always negative or attenuation bias, but the hypothesis we suggest to explain it is compatible with attenuation bias of the treatment effects to 0 and treatment effects generally being positive. However, when we talk about attenuation in the paper, we’re typically talking about attenuation in the prediction of long-run outcomes, not attenuation in the treatment effects.
4b. The surrogate index is unbiased and consistent if the assumptions behind it are satisfied. This is the case for most econometric estimators. What we do in the paper is show that the key surrogacy assumption is empirically not perfectly satisfied in a variety of contexts. Since this assumption is not satisfied, then the estimator is empirically biased and inconsistent in our applications. However, this is not what people typically mean when they say an estimator is theoretically biased and inconsistent. Personally, I think econometrics focuses too heavily on unbiasedness and am sympathetic to the ML willingness to trade off bias and variance, and cares too much about asymptotic properties of estimators and too little about how well they perform in these empirical LaLonde-style tests.
4c. The normalisation depends on the standard deviation of the control group, not the standard error, so we should be fine to do that regardless of what the actual treatment effect is. We would be in trouble if there was no variation in the control group outcome, but this seems to occur very rarely (or never).
I liked this a lot. For context, I work as a RA on an impact evaluation project. I have light interests / familiarity with meta-analysis + machine learning, but I did not know what surrogate indices were going into the paper. Some comments below, roughly in order of importance:
Unclear contribution. I feel there’s 3 contributions here: (1) an application of surrogate method to long-term development RCTs, (2) a graduate-level intro to the surrogate method, and (3) a new M-Lasso method which I mostly ignored. I read the paper mostly for the first 2 contributions, so I was surprised to find out that the novel contribution was actually M-Lasso
Missing relevance for “Very Long-Run” Outcomes. Given the mission of Global Priorities Institute, I was thinking throughout how the surrogate method would work when predicting outcomes on a 100-year horizon or 1000-year horizon. Long-run RCTs will get you around the 10-year mark. But presumably, one could apply this technique to some historical econ studies with (I would assume) shaky foundations.
Intuition and layout is good. I followed a lot of this pretty well despite not knowing the fiddly mechanics of many methods. And I had a good idea on what insight I would gain if I dived into the details in each section. It’s also great that the paper led with a graph diagram and progressed from simple kitchen sink regression before going into the black box ML methods.
Estimator properties could use more clarity.
Unsure what “negative bias” is. I don’t know if the “negative bias” in surrogate index is an empirical result arising from this application, or a theoretical result where the estimator is biased in a negative direction. I’m also unsure if this is attenuation (biasing towards 0) or a honest-to-god negative bias. The paper sometimes mentions attenuation and other times negative bias but as far as I can tell, there’s one surrogacy technique used
Is surrogate index biased and inconsistent? Maybe machine learning sees this differently, but I think of estimators as ideally being unbiased and consistent (i.e. consistent meaning more probability mass around the true value as sample size tends to infinity). I get that the surrogate index has a bias of some kind, but I’m unclear on if there’s also the asymptotic property of consistency. And at some point, a limit is mentioned but not what it’s a limit with respect to (larger sample size within each trial is my guess, but I’m not sure)
How would null effects perform? I might be wrong about this but I think normalization of standard errors wouldn’t work if treatment effects are 0...
Got confused on relation between Prentice criterion and regular unconfoundedness. Maybe this is something I just have to sit down and learn one day, but I initially read Prentice criterion as a standard econometric assumption of exogeneity. But then the theory section mentions Prentice criterion (Assumption 3) as distinct from unconfoundedness (Assumption 1). It is good the assumptions are spelt are since that pointed out a bad assumption I was working with but perhaps this can be clarified.
Analogy to Instrumental Variables / mediators could use a bit more emphasis. The econometric section (lit review?) buries this analogy towards the end. I’m glad it’s mentioned since it clarifies the first-stage vibes I was getting through the theory section, but I feel it’s (1) possibly a good hook to lead the the theory section and (2) something worth discussing a bit more
Could expand Table 1 with summary counts on outcomes per treatment. 9 RCTs sounds tiny, until I remember that these have giant sample sizes, multiple outcomes, and multiple possible surrogates. A summary table of sample size, outcomes, and surrogates used might give a bit more heft to what’s forming the estimates.
Other stuff I really liked
The “selection bias” in long-term RCTs is cool. I like the paragraph discussing how these results are biased by what gets a long-term RCT. Perhaps it’s good emphasizing this as a limitation in the intro or perhaps it’s a good follow-on paper. Another idea is how surrogates would perform in dynamic effects that grow over time. Urban investments, for example, might have no effect until agglomeration kicks in.
The surprising result of surrogates being more precise than actual RCTs outcomes. This was a pretty good hook for me but I could have easily passed over in in the intro. I also think the result here captures the core intuition of bias-variance tradeoff + surrogate assumption in the paper quite strongly.
Hi Geoffrey, thanks for these comments, they are really helpful as we move to submitting this to journals. Some miscellaneous responses:
I’d definitely be interested in seeing a project where the surrogate index approach is applied to even longer-run settings, especially in econ history as you suggest. You could see this article as testing whether the surrogate index approach works in the medium-run, so thinking about how well it works in the longer-run is a very natural extension. I spent some time thinking about how to do this during my PhD and datasets you might do it with, but didn’t end up having capacity. So if you or anyone else is interested in doing this, please get in touch! That said, I don’t think it makes sense to combine these two projects (econ history and RCTs) into one paper, given the norms of economics articles and subdiscipline boundaries.
4a. The negative bias is purely an empirical result, but one that we expect to rise in many applications. We can’t say for sure whether it’s always negative or attenuation bias, but the hypothesis we suggest to explain it is compatible with attenuation bias of the treatment effects to 0 and treatment effects generally being positive. However, when we talk about attenuation in the paper, we’re typically talking about attenuation in the prediction of long-run outcomes, not attenuation in the treatment effects.
4b. The surrogate index is unbiased and consistent if the assumptions behind it are satisfied. This is the case for most econometric estimators. What we do in the paper is show that the key surrogacy assumption is empirically not perfectly satisfied in a variety of contexts. Since this assumption is not satisfied, then the estimator is empirically biased and inconsistent in our applications. However, this is not what people typically mean when they say an estimator is theoretically biased and inconsistent. Personally, I think econometrics focuses too heavily on unbiasedness and am sympathetic to the ML willingness to trade off bias and variance, and cares too much about asymptotic properties of estimators and too little about how well they perform in these empirical LaLonde-style tests.
4c. The normalisation depends on the standard deviation of the control group, not the standard error, so we should be fine to do that regardless of what the actual treatment effect is. We would be in trouble if there was no variation in the control group outcome, but this seems to occur very rarely (or never).