As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.
My overview of this discussion
It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds’ effectiveness. In particular, the key question here is what our estimate of the effect size of SM’s programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI’s report—and that neither FP’s nor GWWC’s recommendation hinges on “secret” information. As I indicate below, there are some materials that can’t be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.
We are all looking at more or less the same evidence and coming to different conclusions.
I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can’t achieve bednet-level levels of confidence. We simply don’t agree, and accordingly this is not FP’s approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.
We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.
How this conversation will and won’t affect FP’s position
What we won’t do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP’s research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve.
What we will do is incorporate some important takeaways from this conversation during StrongMinds’ next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size. Incorporating this uncertainty would deflate SM’s rating and may bring it much closer to our bar of 1x GiveDirectly.
More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.
FP’s materials on StrongMinds
A copy of our CEA. I’m afraid this may not be very elucidating, as essentially all we did here is take HLI’s estimates and put them into a format that works better with our ratings system. One note is that we don’t apply any subjective discounts in this CEA—this is the kind of thing I expect might change in future.
Some exploration I did in R and Stan to try to test various components of the analysis. In particular, this contains several attempts to use SM’s pre-post data (corrected for a hypothesized counterfactual) to update on several different more general priors. Of particular interest are this review from which I took a prior on psychosocial interventions in LMICs and this one which offers a much more outside view-y prior.
Crucially, I really don’t think this type of explicit Bayesian update is the right way to estimate effects here; I much prefer HLI’s way of estimating effects (it leaves a lot less data on the table).
The main goal of this admittedly informal analysis was to test under what alternate analytic conditions our estimate of SM’s effectiveness would fall below our recommendation bar.
We have an internal evaluation template that I have not shared, since it contains quotes from private communications with StrongMinds. There’s nothing mysterious or particularly informative here; we just don’t share details of private communications that weren’t conducted with the explicit expectation that they’d be shared. This is the type of template that in future we hope to post publicly with privileged communications excised.
How I view the evidence about StrongMinds
Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity’s impact. When I reviewed HLI’s work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.
A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don’t agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.
I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.
The reported effects are essentially made-up. The intervention has no effect at all, and the illusion of an effect is driven by fraud at worst and severe confirmation bias at best.
The reported effects are severely inflated by selection bias, social desirability bias, and other similar factors.
I am very satisfied that (1) is not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not “out of nowhere”; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds’ recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.
Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you’ll note in the link above, we didn’t do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it’s 80%, we should deflate our rating to 1.2x at StrongMinds’ net review. In this scenario it would still clear our bar (though only just).
In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don’t think this is a Pascal’s Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of effect we should expect on priors.
What I expect for the future
In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there’s a 70-80% chance we’ll still be recommending SM after its next re-evaluation.
During the re-evaluation, it would be great if FP could also check the partnership programme by StrongMinds—e.g. whether this is an additional source of revenue for them, and what the operational costs of the partners who help treat additional patients for them are. At the moment these costs are not incorporated into HLI’s CEA, but partners were responsible for ~50 and ~80% of the clients treated in 2021 and 2022 respectively. For example, if we crudely assume costs of treatment per client are constant regardless of whether it’s treated by StrongMinds or by a StrongMinds partner, then:
Starting with 5x GiveDirectly, and using 2021 figures, if >~60% of the observed effect is due to bias it will be <1x GiveDirectly.
Starting with 5x GiveDirectly, and using 2022 figures, if >~0% of the observed effect is due to bias, it will be at <1x GiveDirectly.
(Thanks again for all your work, looking forward to the re-evaluation!)
Thanks, bruce — this is a great point. I’m not sure if we would account for the costs in the exact way I think you have done here, but we will definitely include this consideration in our calculation.
Out of interest what do your probabilities correspond to in terms of the outcome from the Ozler RCT? (Or is your uncertainty more in terms of what you might find when re-evaluating the entire framwork?)
I haven’t thought extensively about what kind of effect size I’d expect, but I think I’m roughly 65-70% confident that the RCT will return evidence of a detectable effect.
But my uncertainty is more in terms of rating upon re-evaluating the whole thing. Since I reviewed SM last year, we’ve started to be a lot more punctilious about incorporating various discounts and forecasts into CEAs. So on the one hand I’d naturally expect us to apply more of those discounts on reviewing this case, but on the other hand my original reason for not discounting HLI’s effect size estimates was my sense that their meta-analytic weightings appropriately accounted for a lot of the concerns that we’d discount for. This generates uncertainty that I expect we can resolve once we dig in.
As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.
My overview of this discussion
It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds’ effectiveness. In particular, the key question here is what our estimate of the effect size of SM’s programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI’s report—and that neither FP’s nor GWWC’s recommendation hinges on “secret” information. As I indicate below, there are some materials that can’t be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.
We are all looking at more or less the same evidence and coming to different conclusions.
I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can’t achieve bednet-level levels of confidence. We simply don’t agree, and accordingly this is not FP’s approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.
We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.
How this conversation will and won’t affect FP’s position
What we won’t do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP’s research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve.
What we will do is incorporate some important takeaways from this conversation during StrongMinds’ next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size. Incorporating this uncertainty would deflate SM’s rating and may bring it much closer to our bar of 1x GiveDirectly.
More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.
FP’s materials on StrongMinds
A copy of our CEA. I’m afraid this may not be very elucidating, as essentially all we did here is take HLI’s estimates and put them into a format that works better with our ratings system. One note is that we don’t apply any subjective discounts in this CEA—this is the kind of thing I expect might change in future.
Some exploration I did in R and Stan to try to test various components of the analysis. In particular, this contains several attempts to use SM’s pre-post data (corrected for a hypothesized counterfactual) to update on several different more general priors. Of particular interest are this review from which I took a prior on psychosocial interventions in LMICs and this one which offers a much more outside view-y prior.
Crucially, I really don’t think this type of explicit Bayesian update is the right way to estimate effects here; I much prefer HLI’s way of estimating effects (it leaves a lot less data on the table).
The main goal of this admittedly informal analysis was to test under what alternate analytic conditions our estimate of SM’s effectiveness would fall below our recommendation bar.
We have an internal evaluation template that I have not shared, since it contains quotes from private communications with StrongMinds. There’s nothing mysterious or particularly informative here; we just don’t share details of private communications that weren’t conducted with the explicit expectation that they’d be shared. This is the type of template that in future we hope to post publicly with privileged communications excised.
How I view the evidence about StrongMinds
Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity’s impact. When I reviewed HLI’s work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.
A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don’t agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.
I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.
The reported effects are essentially made-up. The intervention has no effect at all, and the illusion of an effect is driven by fraud at worst and severe confirmation bias at best.
The reported effects are severely inflated by selection bias, social desirability bias, and other similar factors.
I am very satisfied that (1) is not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not “out of nowhere”; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds’ recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.
Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you’ll note in the link above, we didn’t do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it’s 80%, we should deflate our rating to 1.2x at StrongMinds’ net review. In this scenario it would still clear our bar (though only just).
In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don’t think this is a Pascal’s Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of effect we should expect on priors.
What I expect for the future
In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there’s a 70-80% chance we’ll still be recommending SM after its next re-evaluation.
During the re-evaluation, it would be great if FP could also check the partnership programme by StrongMinds—e.g. whether this is an additional source of revenue for them, and what the operational costs of the partners who help treat additional patients for them are. At the moment these costs are not incorporated into HLI’s CEA, but partners were responsible for ~50 and ~80% of the clients treated in 2021 and 2022 respectively. For example, if we crudely assume costs of treatment per client are constant regardless of whether it’s treated by StrongMinds or by a StrongMinds partner, then:
Starting with 5x GiveDirectly, and using 2021 figures, if >~60% of the observed effect is due to bias it will be <1x GiveDirectly.
Starting with 5x GiveDirectly, and using 2022 figures, if >~0% of the observed effect is due to bias, it will be at <1x GiveDirectly.
(Thanks again for all your work, looking forward to the re-evaluation!)
Thanks, bruce — this is a great point. I’m not sure if we would account for the costs in the exact way I think you have done here, but we will definitely include this consideration in our calculation.
Out of interest what do your probabilities correspond to in terms of the outcome from the Ozler RCT? (Or is your uncertainty more in terms of what you might find when re-evaluating the entire framwork?)
I haven’t thought extensively about what kind of effect size I’d expect, but I think I’m roughly 65-70% confident that the RCT will return evidence of a detectable effect.
But my uncertainty is more in terms of rating upon re-evaluating the whole thing. Since I reviewed SM last year, we’ve started to be a lot more punctilious about incorporating various discounts and forecasts into CEAs. So on the one hand I’d naturally expect us to apply more of those discounts on reviewing this case, but on the other hand my original reason for not discounting HLI’s effect size estimates was my sense that their meta-analytic weightings appropriately accounted for a lot of the concerns that we’d discount for. This generates uncertainty that I expect we can resolve once we dig in.