A low reliability outcome measure attenuates the measured effect size. So if researchers measure the effect of one intervention on a high-quality outcome measure, and they measure the effect of another intervention on a lower-quality outcome measure, the use of different measures will inflate the apparent relative impact of the intervention that got higher-quality measurement. Converting different scales into number of SDs puts them all on the same scale, but doesn’t adjust for this measurement issue.
For example, if you have a continuous outcome measure and you dichotomize it by taking a median split (so half get a score of zero and half get a score of one), that will shrink your effect size (number of SDs) to about 80% of what it would’ve been on the continuous measure. So if you would’ve gotten an effect size of 0.08 SDs on the continuous measures, you’ll find an effect size of .064 SDs on this binary measure.
I think that using a three point scale to measure happiness should produce at least as much attenuation as taking a continuous measure and then carving it up into three groups. Here are some sample calculations to estimate how much that attenuates the effect size. I believe the best case scenario is if the responses are trichotomized into three equally sized groups, which would shrink the effect size to about 89% of what it would’ve been on the continuous measure, e.g. from .08 to .071. At a glance I don’t see descriptive statistics for how many people selected each option on the happy123 measure in this study, so I can’t do a calculation that directly corresponds to this study. (I also don’t know how you did the measurement for the study of StrongMinds, which would be necessary for comparing them head-to-head.)
This is an interesting topic, but we’d need more time to look into it. We would like to look into this more when we have more time.
We agree that the 3-point measure is not optimal. However, we think our general conclusion still holds when we examine the effect using other measures of subjective wellbeing in the data (including a 1-10 scale, some 1-6 frequency scales). None of the other measures are significant, and we get a similar result (see Appendix A3.1).
Are you suggesting that this (1-.89 = .11) 11% shrinkage would justify increasing the cost-effectiveness of deworming by 11%? If so, even such an adjustment applied to our ‘optimistic’ model (see Appendix A1) would not change our conclusion that deworming is not more cost-effective than StrongMinds (and even if it did, it wouldn’t change the larger problem that the evidence here is still very weak and noisy).
(I also don’t know how you did the measurement for the study of Stronger Minds, which would be necessary for comparing them head-to-head.)
The StrongMinds analysis is based on a meta-analysis of psychotherapy in LMICs combined with some studies relevant to the StrongMinds method. This includes a lot of different types of measures with varying scale lengths.
I think the correct adjustment would involve multiplying the effect size by something like 1.1 or 1.2. But figuring out the best way to deal with it should involve some combination of looking into this issue in more depth and/or consulting with someone with more expertise on this sort of statistical issue.
This sort of adjustment wouldn’t change your bottom-line conclusions that this point estimate for deworming is smaller than the point estimate for StrongMinds, and that this estimate for deworming is not statistically significant, but it would shift some of the distributions & probabilities that you discuss (including the probability that StrongMinds has a larger well-being effect than deworming).
A low reliability outcome measure attenuates the measured effect size. So if researchers measure the effect of one intervention on a high-quality outcome measure, and they measure the effect of another intervention on a lower-quality outcome measure, the use of different measures will inflate the apparent relative impact of the intervention that got higher-quality measurement. Converting different scales into number of SDs puts them all on the same scale, but doesn’t adjust for this measurement issue.
For example, if you have a continuous outcome measure and you dichotomize it by taking a median split (so half get a score of zero and half get a score of one), that will shrink your effect size (number of SDs) to about 80% of what it would’ve been on the continuous measure. So if you would’ve gotten an effect size of 0.08 SDs on the continuous measures, you’ll find an effect size of .064 SDs on this binary measure.
I think that using a three point scale to measure happiness should produce at least as much attenuation as taking a continuous measure and then carving it up into three groups. Here are some sample calculations to estimate how much that attenuates the effect size. I believe the best case scenario is if the responses are trichotomized into three equally sized groups, which would shrink the effect size to about 89% of what it would’ve been on the continuous measure, e.g. from .08 to .071. At a glance I don’t see descriptive statistics for how many people selected each option on the happy123 measure in this study, so I can’t do a calculation that directly corresponds to this study. (I also don’t know how you did the measurement for the study of StrongMinds, which would be necessary for comparing them head-to-head.)
Hi Dan,
This is an interesting topic, but we’d need more time to look into it. We would like to look into this more when we have more time.
We agree that the 3-point measure is not optimal. However, we think our general conclusion still holds when we examine the effect using other measures of subjective wellbeing in the data (including a 1-10 scale, some 1-6 frequency scales). None of the other measures are significant, and we get a similar result (see Appendix A3.1).
Are you suggesting that this (1-.89 = .11) 11% shrinkage would justify increasing the cost-effectiveness of deworming by 11%? If so, even such an adjustment applied to our ‘optimistic’ model (see Appendix A1) would not change our conclusion that deworming is not more cost-effective than StrongMinds (and even if it did, it wouldn’t change the larger problem that the evidence here is still very weak and noisy).
The StrongMinds analysis is based on a meta-analysis of psychotherapy in LMICs combined with some studies relevant to the StrongMinds method. This includes a lot of different types of measures with varying scale lengths.
I think the correct adjustment would involve multiplying the effect size by something like 1.1 or 1.2. But figuring out the best way to deal with it should involve some combination of looking into this issue in more depth and/or consulting with someone with more expertise on this sort of statistical issue.
This sort of adjustment wouldn’t change your bottom-line conclusions that this point estimate for deworming is smaller than the point estimate for StrongMinds, and that this estimate for deworming is not statistically significant, but it would shift some of the distributions & probabilities that you discuss (including the probability that StrongMinds has a larger well-being effect than deworming).