How Much Can We Generalize from Impact Evaluations? (link)

Tyler Cowen posted a link to this paper(PDF), outlining how effective programs are when transported to new contexts, or scaled up by governments.

Two key quotes:

The program implementer is the main source of heterogeneity in results,

with government-implemented programs faring worse than and being poorly predicted

by the smaller studies typically implemented by academic/​NGO research teams, even

controlling for sample size

and:

The average intervention-outcome combination is comprised 37% of positive, significant studies;

58% of insignificant studies; and 5% of negative, significant studies. If a particular result is positive

and significant, there is a 61% chance the next result will be insignificant and a 7% chance the

next result will be significant and negative, leaving only about a 32% chance the next result will

again be positive and significant.

This seems a gloomy diagnosis of the state of interventions. Small scale promising interventions usually have problems scaling.