Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.
I worked previously as a data scientist and as a journalist.
Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.
I worked previously as a data scientist and as a journalist.
I can also vouch for HLI. Per John Salter’s comment, I may also have been a little sus early (sorry Michael) on but HLI’s work has been extremely valuable for our own methodology improvements at Founders Pledge. The whole team is great, and I will second John’s comment to the effect that Joel’s expertise is really rare and that HLI seems to be the right home for it.
“I think my main takeaway is my first one here. GWWC shouldn’t be using your recommendations to label things top charities. Would you disagree with that?”
Yes, I think so- I’m not sure why this should be the case. Different evaluators have different standards of evidence, and GWWC is using ours for this particular recommendation. They reviewed our reasoning and (I gather) were satisfied. As someone else said in the comments, the right reference class here is probably deworming— “big if true.”
The message on the report says that some details have changed, but that our overall view is represented. That’s accurate, though there are some details that are more out of date than others. We don’t want to just remove old research, but I’m open to the idea that this warning should be more descriptive.
I’ll have to wait til next week to address more substantive questions but it seems to me that the recommend/don’t recommend question is most cruxy here.
EDIT:
On reflection, it also seems cruxy that our current evaluation isn’t yet public. This seems very fair to me, and I’d be very curious to hear GWWC’s take. We would like to make all evaluation materials public eventually, but this is not as simple as it might seem and especially hard given our orientation toward member giving.
Though this type of interaction is not ideal for me, it seems better for the community. If they can’t be totally public, I’d rather our recs be semi-public and subject to critique than totally private.
(I am research director at FP)
Thanks for all of your work on this analysis, Vasco. We appreciate your thoroughness and your willingness to engage with us beforehand. The work is obviously methodologically sound and, as Johannes indicated, we generally agree that climate is not among the top bets for reducing existential risk.
I think that “mitigating existential risk as cost-effectively as possible” is entailed by the goal of doing as much good as possible in the world, which is why FP exists. To be absolutely clear, FP’s goal is to do the maximum possible amount of good, and to do so in a cause-neutral way.
A common misconception about our research agenda is that it is driven by the interests of our members. This is most assuredly not the case. To some degree, member-driven research was a component of previous iterations of the research team, and our movement away from this is indeed a relatively recent change. There remain some exceptions, but as a general rule we do not devote research resources to any cause area or charity investigation unless we have a good reason to suspect it might be genuinely valuable from a strictly cause-neutral standpoint.
Still, FP does operate under some constraints, one of which is that many of our 1700 members are not cause-neutral. This is by design. We facilitate our members’ charitable giving to all (legal and feasible) grantees in hopes that we can influence some portion of this money toward highly effective ends. This works. Since our members are often not EAs, such giving is strictly counterfactual: in the absence of FP’s recommendations, it simply would not have been given to effective charities.
Climate plays two roles in a portfolio that is constrained in this way. First, it introduces members who are not cause-neutral to our way of thinking about problems and solutions, which builds credibility and opens the door to further education on cause areas that might not immediately resonate with them (e.g. AI risk). This also works. Second, it reallocates non-cause-neutral funds to the most effective opportunities within a cause area in which the vast majority of philanthropic funds are, unfortunately, misspent. As I have tried to work out in my Shortform, this reallocation can be cost-effective under certain conditions even within otherwise unpromising cause areas (of which climate is not one).
Finally, I do want to emphasize that the Climate Fund does not serve a strictly instrumental role. We genuinely think that the climate grants we make and recommend are a comparatively cost-effective way to improve the value of the long-term future, though not the most cost-effective way. I don’t see any particular tension in that: every EA charity evaluator (or grantmaker) recommends (or grants to) options across a wide range of cost-effectiveness. From our perspective, the Climate Fund is better than most things, but not as good as the best things.
I disagree with the valence of the comment, but think it reflects legitimate concerns.
I am not worried that “HLI’s institutional agenda corrupts its ability to conduct fair-minded and even-handed assessment.” I agree that there are some ways that HLI’s pro-SWB-measurement stance can bleed into overly optimistic analytic choices, but we are not simply taking analyses by our research partners on faith and I hope no one else is either. Indeed, the very reason HLI’s mistakes are obvious is that they have been transparent and responsive to criticism.
We disagree with HLI about SM’s rating — we use HLI’s work as a starting point and arrive at an undiscounted rating of 5-6x; subjective discounts place it between 1-2x, which squares with GiveWell’s analysis. But our analysis was facilitated significantly by HLI’s work, which remains useful despite its flaws.
Thank you so much for writing this. This is one of my central areas of interest, and I’ve been puzzled by the comparative lack of resources expended by the EA community on institutional decision-making given the apparently high degree of importance accorded to it by many of us.
This is a great guide. I agree that the central question here is whether or not deliberative democracy leads to better outcomes. If it does, or even if probably does, it seems that it’s easily one of the highest-value potential cause areas, since the levers that influence many other cause areas are within reach of democratic polities.
With that in mind, it seems clear to me that the primary way in which deliberation is EA-relevant is as a large-scale decision making mechanism. So it seems like relatively small-scale uses are not very important to us, and it also seems like information about these successes may not be useful given the likelihood that instituting these mechanisms at a large scale is likely to present very different problems of kind, not of degree. I’d love to hear your thoughts on that.
I have a few other thoughts about this review, and I’d like to hear your responses if you have the time.
• Basically all of the cross-country comparisons in this review suffer from reverse causation. Countries that have lots of deliberation and good outcomes don’t necessarily have the former causing the latter; the former could rather be just another instance of the latter. As enthused as I am about deliberative democracy, this scenario seems just as likely as the causal one. Is there any reason to view these correlations as suggestive of a causal effect?
• It seems like this review contains a relative paucity of research supporting the null hypothesis that deliberation does not improve decision making (or, for that matter, the alternative hypothesis that it actually worsens decision making). Were you unable to find studies taking this position? If not, how worried are you about the file-drawer effect here?
• Based on your reading of all this evidence, I’d love to hear your subjective first impressions- what do you personally feel is the “best bet” for enacting deliberative democracy on a large scale somewhere besides China? How far do you think this could feasibly go and how long would you expect such a change to take? Very wide confidence bands on these estimates are fine, of course.
I don’t have anything to say except that I loved this, and I’m really happy somebody is starting to present a warmer and fuzzier side of EA.
Has there been any formal probabilistic risk assessment on AI X-risk? e.g. fault tree analysis or event tree analysis — anything of that sort?
How harmful is a fragmented resume? People seem to believe this isn’t much of a problem for early-career professionals, but I’m 30, and my longest tenure was for two and a half years (recently shorter). I like to leave for new and interesting opportunities when I find them, but I’m starting to wonder whether I should avoid good opportunities for the sake of appearing more reliable as a potential employee.
I appreciate your taking the time to write out this idea and the careful thought that went into your post. I liked that it was kind of in the form of a pitch, in keeping with your journalistic theme. I agree that EAs should be thinking more seriously about journalism (in the broadest possible sense) and I think that this is as good a place as any to start. I want to (a) nitpick a few things in your post with an eye to facilitating this broader conversation and (b) point out what I see as an important potential failure mode for an effort like this.
You characterize The Altruist at first as:
a news agency that provides journalistic coverage of EA topics and organisations
This sounds like more or less like a trade publication along the lines of Advertising Age or Publishers Weekly, or perhaps a subject-specific publication oriented more toward the general public, like Popular Science or Nautilus. Generally speaking, I think something like the former is a good idea, though trade publications are generally targeted at those working within an industry. I will describe later on why I am not sure the latter is feasible.
But you go on to say:
Other rough comparisons include The Atlantic, The Economist, the New Yorker, Current Affairs, Works in Progress, and Unherd
These publications are very different from each other. The Economist (where, full disclosure, I worked for a short time) is a general interest newspaper with a print circulation of ~1 million. The New Yorker is a highbrow weekly magazine known for its longform journalistic content. The Atlantic is an eclectic monthly that leans heavily on its regular output of short-form, nonreported digital content. Current Affairs is a bimonthly political magazine with an explicitly left-wing cultural and political agenda. Works in Progress is small, completely online, wholly dedicated to progress studies, and generally nonreported.
Unherd is evidently constructed in opposition to various trends and themes in mainstream political and cultural discourse, and its goal is to disrupt the homogeneity of that discourse. I really enjoy it, but I worry that it sometimes typifies the failure mode I’m worried about. Broadly, that failure mode is this: by defining itself in opposition to the dominant way of thinking, an outlet can sort potential readers out of being interested.
Consider: if a media outlet mainly publishes content that conflicts with the modal narrative, then the modal reader encountering it will find mostly content that challenges their views. I think it is a pernicious but nonetheless reliable feature of the media landscape that most readers who stumble onto such a publication will typically stumble off immediately to another, more comfortable one. I worry that a lot of EA is challenging enough that this could happen with something like The Altruist.
This may actually be fine- that’s why I harp on the precision of the comparison classes: I think Works in Progress, for instance, is likely to serve the progress studies community very well in the years to come, and an EA version of that would serve well the initial goal you describe of improving resources for outreach. But I don’t think that it would do a particularly good job of mitigating reputational risk or increasing community growth, because it would be a niche publication that might find it difficult to earn the trust of readers who find EA ideas challenging (in my experience, this is most people).
So I think as far as new publications go, we may have to pick between the various goals you have helpfully laid out here. But my aspirations for EA in journalism are a bit higher. Here’s my question: what is an EA topic? It is not really obvious to me that there is such a thing. To most people, it is not intuitive, even when you explain, that there is something that ties together (for instance) worrying about AI risk, donating to anti-malaria charities, supporting human challenge trials, and eating vegan.
This is because EA is a way of approaching questions about how to do good in the world, not a collection of answers to those questions.
So my aspiration for journalism in general is not only that it more enthusiastically tackle those issues which this small and idiosyncratic community of people has determined is important. I also think it would be good if journalism in general moved in a more EA-aligned or EA-aware direction on all questions. I think that, counterfactually, the past two decades of journalism in the developed world would look very different if the criterion for newsworthiness was more utilitarian, and if editorial judgments more robustly modeled truth-seeking behavior. Consequently my (weak, working) hypothesis is that the world would be a better place. I also think such a world would be an easier place to grow the community, to combat bad-faith criticism, and to absorb and respond to good-faith critique.
One way to try to make this happen today would be to run a general-interest publication with an editorial position that is openly EA, much as The Economist’s editorial slant is classically liberal. Such a publication would have to cover everything, not just deworming and the lives of people in the far future. But it would, of course, cover those things too.
To bring things back down to the actual topic of conversation: the considerations you have raised here are the right ones. My core concern is that a publication like this will try to do too many things at once, and the reason I’ve written so much above is to try to articulate some additional considerations that I hope will be useful in narrowing down its purpose.
FWIW, I did a quick meta-analysis in Stan of the adjusted 5-year dropout rates in your first table (for those surveys where the sample size is known). The punchline is an estimated true mean cross-study dropout rate of ~23%, with a 90% CI of roughly [5%, 41%]. For good measure, I also fit the data to a beta distribution and came up with a similar result.
I struggle with how to interpret these numbers. It’s not clear to me that the community dropout rate is a good proxy for value drift (however it’s defined), as in some sense it is a central hope of the community that the values will become detached from the movement—I think we want more and more people to feel “EA-like”, regardless of whether they’re involved with the community. It’s easy for me to imagine that people who drift out of the movement (and stop answering the survey) maintain broad alignment with EA’s core values. In this sense, the “core EA community” around the Forum, CEA, 80k, etc is less of a static glob and more of a mechanism for producing people who ask certain questions about the world.
Conversely, value drift within members who are persistently engaged in the community seems to be of real import, and presumably the kind of thing that can only be tracked longitudinally, by matching EA Survey respondents across years.
Thanks for this! It seems like much of the work that went into your CEA could be repurposed for explorations of other potentially growth- or governance-enhancing interventions. Since finding such an intervention would be quite high-value, and since the parameters in your CEA are quite uncertain, it seems like the value of information with respect to clarifying these parameters (and therefore the final ROI distribution) is probably very high.
Do you have a sense of what kind of research or data would help you narrow the uncertainty in the parameter inputs of your cost-effectiveness model?
Thanks for writing this! I take the broader point and I think you provide good reasons to think that international trade deserves more attention as an effective intervention.
I may be missing something, but I’m really not sure what to make of that $200k number. It seems low intuitively, but a little examination makes it seem even stranger. In 2018, about $3.5 billion was spent on lobbying. In the 115th congress, 2017-2019, 443 bills were passed, as in, actually became law. So it seems reasonable to say that about 200 bills became law in 2018. That’s almost twenty million dollars per bill. And that’s in a weird idealized scenario where spending on lobbying gets the bill passed and where all lobbying money is being spent on lobbying-for (not lobbying-against) and where the money is evenly divided across bills.
We have no idea what the distribution of effectiveness looks like, and I totally buy the idea that some bills can be passed with only $200k in lobbying funds, but that would be true at the tails of the distribution, not in expectation.
We (Founders Pledge) do have a significant presence in SF, and are actively trying to grow much faster in the U.S. in 2024.
A couple weakly held takes here, based on my experience:
Although it’s true that issues around effective giving are much more salient in the Bay Area, it’s also the case that effective giving is nearly as much of an uphill battle with SF philanthropists as with others. People do still have pet causes, and there are many particularities about the U.S. philanthropic ecosystem that sometimes push against individuals’ willingness to take the main points of effective giving on board.
Relatedly, growing in SF seems in part to be hard essentially because of competition. There’s a lot of money and philanthropic intent, and a fair number of existing organizations (and philanthropic advisors, etc) that are focused on capturing that money and guiding that philanthropy. So we do face the challenge of getting in front of people, getting enough of their time, etc.
Since FP has historically offered mostly free services to members, growing our network in SF is something we actually need to fundraise for. On the margin I believe it’s worthwhile, given the large number of potentially aligned UHNWs, but it’s the kind of investment (in this case, in Founders Pledge by its funders) that would likely take a couple years to bear fruit in terms of increased amounts of giving to effective charities. I expect this is also a consideration for other existing groups that are thinking about raising money for a Bay Area expansion.
Fair enough. I think one important thing to highlight here is that though the details of our analysis have changed since 2019, the broad strokes haven’t — that is to say, the evidence is largely the same and the transformation used (DALY vs WELLBY), for instance, is not super consequential for the rating.
The situation is one, as you say, of GIGO (though we think the input is not garbage) and the main material question is about the estimated effect size. We rely on HLI’s estimate, the methodology for which is public.
I think your (2) is not totally fair to StrongMinds, given the Ozler RCT. No matter how it turns out, it will have a big impact on our next reevaluation of StrongMinds.
Edit: To be clearer, we shared updated reasoning with GWWC but the 2019 report they link, though deprecated, still includes most of the key considerations for critics, as evidenced by your observations here, which remain relevant. That is, if you were skeptical of the primary evidence on SM, our new evaluation would not cause you to update to the other side of the cost-effectiveness bar (though it might mitigate less consequential concerns about e.g. disability weights).
Thanks for the writeup!
If the recent Bill Gates documentary on Netflix is to be believed, then Gates first became seriously aware of the problem of diarrhea in the developing world thanks to a 1998 column by Nicholas Kristof. It’s hard to assess the counterfactual here (would Gates have encountered the issue in a different context? Would he have taken the steps he ultimately did after reading the Kristof piece?) but it seems plausible that Kristof’s article constitutes a cost-effective intervention in its own right (if a not particularly targeted one).
I bring this up because I’m intrigued by the viral coverage of your clean energy research. It’s not possible to quantify the impact of an article like this in any realistic way, but perhaps we can agree that a plausible distribution of beliefs about its value is close to strictly positive.
Future Perfect being what it is, it’s obviously the case that Vox constitutes an unusually receptive channel for EA-adjacent research. But I’m curious if you consider the wide propagation of your research in the news media a “risky and very effective” project, and if your research products have been intentionally structured toward this end. If you have some takeaways from your big success so far, it could be very helpful to post them here- widely taken-up tweaks to make research propagate more effectively through the media are marginal improvements with potentially very high value.
This is a great post and I, like @rohinmshah, feel that simply the introduction of this general class of discussion is of value to the community.
With respect to expert surveys, I am somewhat surprised that there isn’t someone in the EA community already pursuing this avenue in earnest. I think that it’s firmly within the wheelhouse of the community’s larger knowledge-building project to conduct something like the IGM experts panel across a variety of fields. I think, first, that this sort of thing is direly needed in the world at large and could have considerable direct positive effects, but secondly that it could have a number of virtues for the EA community:
Improve efficiency of additional research: Knowing what the expert consensus is on a given topic will save some nontrivial percentage of time when starting a literature review, and help researchers contextualize papers that they find over the course of the review. Expert consensus is a good starting place for a lit review, and surveys will save time and reduce uncertainty in that phase.
Let EAs know where we stand relative to the expert consensus: when we explore topics like growth as a cause area, we need to be able to (1) have a quick reference to the expert consensus at vital pivots in a conversation (e.g. do structural adjustments work?) and (2) identify with certainty where EA views might depart from the consensus.
Provide a basis for argument to policymakers and philanthropists: Appeals to authority are powerful persuasive mechanisms outside the EA community. Being able to fall back on expert consensus in any range of issues can be a powerful obstacle or motivator, depending on the issue. Here’s an example: governments around the world continue to locally relitigate conversations about the degree to which electronic voting is safe, desirable, secure or feasible. Security researchers have a pretty solid consensus on these questions—that consensus should be available to these governments and those of us who seek to influence them.
Demonstrate to those outside the community that EAs are directly linked to the mainstream research community: This is a legitimacy issue: regardless of whether the EA community ends up being broader or narrower, we are often insisting to some degree on a new way of doing things: we need to be able to demonstrate to newcomers and outsiders that we are not simply starting from scratch.
Establish continued relationships with experts across a variety of fields: Repeated deployment of these expert surveys affords opportunities for contact with experts who can be integrated into projects, sought for advice, or deployed (in the best case scenario) as voices on behalf of sensible policies or interventions.
Identify funding opportunities for further research or for novel epistemic avenues like the adversarial collaborations mentioned in the initial post: Expert surveys will reveal areas where there is no consensus. Although consensus can be and sometimes is wrong, areas where there is considerable disagreement seem like obvious avenues for further exploration. Where issues have a direct bearing on human wellbeing, uncovering a relative lack of conclusive research seems like a cause area in and of itself.
Finally, the question-finding and -constructing process is itself an important activity that requires expert input. Identifying the key questions to ask experts is itself very important research, and can result in constructive engagements with experts and others.
As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.
My overview of this discussion
It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds’ effectiveness. In particular, the key question here is what our estimate of the effect size of SM’s programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI’s report—and that neither FP’s nor GWWC’s recommendation hinges on “secret” information. As I indicate below, there are some materials that can’t be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.
We are all looking at more or less the same evidence and coming to different conclusions.
I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can’t achieve bednet-level levels of confidence. We simply don’t agree, and accordingly this is not FP’s approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.
We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.
How this conversation will and won’t affect FP’s position
What we won’t do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP’s research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve.
What we will do is incorporate some important takeaways from this conversation during StrongMinds’ next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size. Incorporating this uncertainty would deflate SM’s rating and may bring it much closer to our bar of 1x GiveDirectly.
More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.
FP’s materials on StrongMinds
A copy of our CEA. I’m afraid this may not be very elucidating, as essentially all we did here is take HLI’s estimates and put them into a format that works better with our ratings system. One note is that we don’t apply any subjective discounts in this CEA—this is the kind of thing I expect might change in future.
Some exploration I did in R and Stan to try to test various components of the analysis. In particular, this contains several attempts to use SM’s pre-post data (corrected for a hypothesized counterfactual) to update on several different more general priors. Of particular interest are this review from which I took a prior on psychosocial interventions in LMICs and this one which offers a much more outside view-y prior.
Crucially, I really don’t think this type of explicit Bayesian update is the right way to estimate effects here; I much prefer HLI’s way of estimating effects (it leaves a lot less data on the table).
The main goal of this admittedly informal analysis was to test under what alternate analytic conditions our estimate of SM’s effectiveness would fall below our recommendation bar.
We have an internal evaluation template that I have not shared, since it contains quotes from private communications with StrongMinds. There’s nothing mysterious or particularly informative here; we just don’t share details of private communications that weren’t conducted with the explicit expectation that they’d be shared. This is the type of template that in future we hope to post publicly with privileged communications excised.
How I view the evidence about StrongMinds
Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity’s impact. When I reviewed HLI’s work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.
A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don’t agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.
I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.
The reported effects are essentially made-up. The intervention has no effect at all, and the illusion of an effect is driven by fraud at worst and severe confirmation bias at best.
The reported effects are severely inflated by selection bias, social desirability bias, and other similar factors.
I am very satisfied that (1) is not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not “out of nowhere”; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds’ recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.
Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you’ll note in the link above, we didn’t do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it’s 80%, we should deflate our rating to 1.2x at StrongMinds’ net review. In this scenario it would still clear our bar (though only just).
In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don’t think this is a Pascal’s Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of effect we should expect on priors.
What I expect for the future
In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there’s a 70-80% chance we’ll still be recommending SM after its next re-evaluation.
I guess I would very slightly adjust my sense of HLI, but I wouldn’t really think of this as an “error.” I don’t significantly adjust my view of GiveWell when they delist a charity based on new information.
I think if the RCT downgrades StrongMinds’ work by a big factor, that won’t really introduce new information about HLI’s methodology/expertise. If you think there are methodological weaknesses that would cause them to overstate StrongMinds’ impact, those weaknesses should be visible now, irrespective of the RCT results.
I have already installed this and started using this at Founders Pledge. Thanks for making this! I’ve been wanting something like this for a long time.
Some feature requests:
Aggregation choices (e.g. geo mean of odds would be nice)
Brier scores for users
Calibration curves for users
Hi Simon, thanks for writing this! I’m research director at FP, and have a few bullets to comment here in response, but overall just want to indicate that this post is very valuable. I’m also commenting on my phone and don’t have access to my computer at the moment, but can participate in this conversation more energetically (and provide more detail) when I’m back at work next week.
I basically agree with what I take to be your topline finding here, which is that more data is needed before we can arrive at GiveWell-tier levels of confidence about StrongMinds. I agree that a lack of recent follow-ups is problematic from an evaluator’s standpoint and look forward to updated data.
FP doesn’t generally strive for GW-tier levels of confidence; we’re risk-neutral and our general procedure is to estimate expected cost-effectiveness inclusive of deflators for various kinds of subjective consideration, like social desirability bias.
The 2019 report you link (and the associated CEA) is deprecated— FP hasn’t been resourced to update public-facing materials, a situation that is now changing—but the proviso at the top of the page is accurate: we stand by our recommendation.
This is because we re-evaluated StrongMinds last year based on HLI’s research. The new evaluation did things a little bit differently than the old one. Instead of attempting to linearize the relationship between PHQ-9 score reductions and disability weights, we converted the estimated treatment effect into WELLBY-SDs by program and area of operation, an elaboration made possible by HLI’s careful work and using their estimated effect sizes. I reviewed their methodology and was ultimately very satisfied— I expect you may want to dig into this in more detail. This also ultimately allows the direct comparison with cash transfers.
I pressure-tested this a few ways: by comparing with the linearized-DALY approach from our first evaluation, from treating the pre-post data as evidence in a Bayesian update on conservative priors, and by modeling the observed shifts using the assumption of an exponential distribution in PHQ-9 scores (which appears in some literature on the topic). All these methods made the intervention look pretty cost-effective.
Crucially, FP’s bar for recommendation to members is GiveDirectly, and we estimate StrongMinds at roughly 6x GD. So even by our (risk-neutral) lights, StrongMinds is not competitive with GiveWell top charities.
A key piece of evidence is the 2002 RCT on which the program is based. I do think a lot—perhaps too much—hinges on this RCT. Ultimately, however, I think it is the most relevant way to develop a prior on the effectiveness of IPT-g in the appropriate context, which is why a Bayesian update on the pre-post data seems so optimistic: the observed effects are in line with the effect size from the RCT.
The effect sizes observed are very large, but it’s important to place in the context of StrongMinds’ work with severely traumatized populations. Incoming PHQ-9 scores are very, very high, so I think 1) it’s reasonable to expect some reversion to the mean in control groups, which we do see and 2) I’m not sure that our general priors about the low effectiveness of therapeutic interventions are likely to be well-calibrated here.
Overall, I agree that social desirability bias is a very cruxy issue here, and could bear on our rating of StrongMinds. But I think even VERY conservative assumptions about the role of this bias in the effect estimate would cause SM still to clear our bar in expectation.
Just a note on FP and our public-facing research: we’re in the position of prioritizing our research resources primarily to make recommendations to members, but at the same time we’re trying to do our best to provide a public good to the public and the EA community. I think we’re still not sure how to do this, and definitely not sure how to resource for it. We are working on this.