Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.
I worked previously as a data scientist and as a journalist.
Currently Research Director at Founders Pledge, but posts and comments represent my own opinions, not FP’s, unless otherwise noted.
I worked previously as a data scientist and as a journalist.
Thanks, bruce — this is a great point. I’m not sure if we would account for the costs in the exact way I think you have done here, but we will definitely include this consideration in our calculation.
I haven’t thought extensively about what kind of effect size I’d expect, but I think I’m roughly 65-70% confident that the RCT will return evidence of a detectable effect.
But my uncertainty is more in terms of rating upon re-evaluating the whole thing. Since I reviewed SM last year, we’ve started to be a lot more punctilious about incorporating various discounts and forecasts into CEAs. So on the one hand I’d naturally expect us to apply more of those discounts on reviewing this case, but on the other hand my original reason for not discounting HLI’s effect size estimates was my sense that their meta-analytic weightings appropriately accounted for a lot of the concerns that we’d discount for. This generates uncertainty that I expect we can resolve once we dig in.
As promised, I am returning here with some more detail. I will break this (very long) comment into sections for the sake of clarity.
My overview of this discussion
It seems clear to me that what is going on here is that there are conflicting interpretations of the evidence on StrongMinds’ effectiveness. In particular, the key question here is what our estimate of the effect size of SM’s programs should be. There are other uncertainties and disagreements, but in my view, this is the essential crux of the conversation. I will give my own (personal) interpretation below, but I cannot stress enough that the vast majority of the relevant evidence is public—compiled very nicely in HLI’s report—and that neither FP’s nor GWWC’s recommendation hinges on “secret” information. As I indicate below, there are some materials that can’t be made public, but they are simply not critical elements of the evaluation, just quotes from private communications and things of that nature.
We are all looking at more or less the same evidence and coming to different conclusions.
I also think there is an important subtext to this conversation, which is the idea that both GWWC and FP should not recommend things for which we can’t achieve bednet-level levels of confidence. We simply don’t agree, and accordingly this is not FP’s approach to charity evaluation. As I indicated in my original comment, we are risk-neutral and evaluate charities on the basis of expected cost-effectiveness. I think GiveWell is about as good as an organization can be at doing what GiveWell does, and for donors who prioritize their giving conditional on high levels of confidence, I will always recommend GiveWell top charities over others, irrespective of expected value calculations. It bears repeating that even with this orientation, we still think GiveWell charities are around twice as cost-effective as StrongMinds. I think Founders Pledge is in a substantially different position, and from the standpoint of doing the most possible good in the world, I am confident that risk-neutrality is the right position for us.
We will provide our recommendations, along with any shareable information we have to support them, to anyone who asks. I am not sure what the right way for GWWC to present them is.
How this conversation will and won’t affect FP’s position
What we won’t do is take immediate steps (like, this week) to modify our recommendation or our cost-effectiveness analysis of StrongMinds. My approach to managing FP’s research is to try to thoughtfully build processes that maximize the good we do over the long term. This is not a procedure fetish; this is a commonsensical way to ensure that we prioritize our time well and allocate important questions the resources and systematic thought they deserve.
What we will do is incorporate some important takeaways from this conversation during StrongMinds’ next re-evaluation, which will likely happen in the coming months. To my eye, the most important takeaway is that our rating of StrongMinds may not sufficiently account for uncertainty around effect size. Incorporating this uncertainty would deflate SM’s rating and may bring it much closer to our bar of 1x GiveDirectly.
More generally, I do agree with the meta-point that our evaluations should be public. We are slowly but surely moving in this direction over time, though resource constraints make it a slow process.
FP’s materials on StrongMinds
A copy of our CEA. I’m afraid this may not be very elucidating, as essentially all we did here is take HLI’s estimates and put them into a format that works better with our ratings system. One note is that we don’t apply any subjective discounts in this CEA—this is the kind of thing I expect might change in future.
Some exploration I did in R and Stan to try to test various components of the analysis. In particular, this contains several attempts to use SM’s pre-post data (corrected for a hypothesized counterfactual) to update on several different more general priors. Of particular interest are this review from which I took a prior on psychosocial interventions in LMICs and this one which offers a much more outside view-y prior.
Crucially, I really don’t think this type of explicit Bayesian update is the right way to estimate effects here; I much prefer HLI’s way of estimating effects (it leaves a lot less data on the table).
The main goal of this admittedly informal analysis was to test under what alternate analytic conditions our estimate of SM’s effectiveness would fall below our recommendation bar.
We have an internal evaluation template that I have not shared, since it contains quotes from private communications with StrongMinds. There’s nothing mysterious or particularly informative here; we just don’t share details of private communications that weren’t conducted with the explicit expectation that they’d be shared. This is the type of template that in future we hope to post publicly with privileged communications excised.
How I view the evidence about StrongMinds
Our task as charity evaluators is, to the extent possible, to quantify the important considerations in estimating a charity’s impact. When I reviewed HLI’s work on StrongMinds, I was very satisfied that they had accounted for many different sources of uncertainty. I am still pretty satisfied, though I am now somewhat more uncertain myself.
A running theme in critiques of StrongMinds is that the effects they report are unbelievably large. I agree that they are very large. I don’t agree that the existence of large-seeming effects is itself a knockdown argument against recommending this charity. It is, rather, a piece of evidence that we should consider alongside many other pieces of evidence.
I want to oversimplify a bit by distinguishing between two different views of how SM could end up reporting very large effect sizes.
The reported effects are essentially made-up. The intervention has no effect at all, and the illusion of an effect is driven by fraud at worst and severe confirmation bias at best.
The reported effects are severely inflated by selection bias, social desirability bias, and other similar factors.
I am very satisfied that (1) is not the case here. There are two reasons for this. First, the intervention is well-supported by a fair amount of external evidence. This program is not “out of nowhere”; there are good reasons to believe it has some (possibly small) effect. Second, though StrongMinds’ recent data collection practices have been wanting, they have shown a willingness to be evaluated (the existence of the Ozler RCT is a key data point here). With FP, StrongMinds were extremely responsive to questions and forthcoming and transparent with their answers.
Now, I think (2) is very likely to be the case. At FP, we increasingly try to account for this uncertainty in our CEAs. As you’ll note in the link above, we didn’t do that in our last review of StrongMinds, yielding a rating of roughly 5-6xGiveDirectly (per our moral weights, we value a WELLBY at about $160). So the question here is how much of the observed effect is due to bias? If it’s 80%, we should deflate our rating to 1.2x at StrongMinds’ net review. In this scenario it would still clear our bar (though only just).
In the absence of prior evidence about IPT-g, I think we might likely conclude that the observed effects are overwhelmingly due to bias. But I don’t think this is a Pascal’s Mugging-type scenario. We are not seeing a very large, possibly dubious effect that remains large in expectation even after deflating for dubiousness. We are seeing a large effect that is very broadly in line with the kind of effect we should expect on priors.
What I expect for the future
In my internal forecast attached to our last evaluation, I gave an 80% probability to us finding that SM would have an effectiveness of between 5.5x and 7x GD at its next evaluation. I would lower this significantly, to something like 40%, and overall I would say that I think there’s a 70-80% chance we’ll still be recommending SM after its next re-evaluation.
Hey Simon, I remain slightly confused about this element of the conversation. I take you to mean that, since we base our assessment mostly on HLI’s work, and since we draw different conclusions from HLI’s work than you think are reasonable, we should reassess StrongMinds on that basis. Is that right?
If so, I do look forward to your thoughts on the HLI analysis, but in the meantime I’d be curious to get a sense of your personal levels of confidence here — what does a distribution of your beliefs over cost-effectiveness for StrongMinds look like?
Fair enough. I think one important thing to highlight here is that though the details of our analysis have changed since 2019, the broad strokes haven’t — that is to say, the evidence is largely the same and the transformation used (DALY vs WELLBY), for instance, is not super consequential for the rating.
The situation is one, as you say, of GIGO (though we think the input is not garbage) and the main material question is about the estimated effect size. We rely on HLI’s estimate, the methodology for which is public.
I think your (2) is not totally fair to StrongMinds, given the Ozler RCT. No matter how it turns out, it will have a big impact on our next reevaluation of StrongMinds.
Edit: To be clearer, we shared updated reasoning with GWWC but the 2019 report they link, though deprecated, still includes most of the key considerations for critics, as evidenced by your observations here, which remain relevant. That is, if you were skeptical of the primary evidence on SM, our new evaluation would not cause you to update to the other side of the cost-effectiveness bar (though it might mitigate less consequential concerns about e.g. disability weights).
“I think my main takeaway is my first one here. GWWC shouldn’t be using your recommendations to label things top charities. Would you disagree with that?”
Yes, I think so- I’m not sure why this should be the case. Different evaluators have different standards of evidence, and GWWC is using ours for this particular recommendation. They reviewed our reasoning and (I gather) were satisfied. As someone else said in the comments, the right reference class here is probably deworming— “big if true.”
The message on the report says that some details have changed, but that our overall view is represented. That’s accurate, though there are some details that are more out of date than others. We don’t want to just remove old research, but I’m open to the idea that this warning should be more descriptive.
I’ll have to wait til next week to address more substantive questions but it seems to me that the recommend/don’t recommend question is most cruxy here.
EDIT:
On reflection, it also seems cruxy that our current evaluation isn’t yet public. This seems very fair to me, and I’d be very curious to hear GWWC’s take. We would like to make all evaluation materials public eventually, but this is not as simple as it might seem and especially hard given our orientation toward member giving.
Though this type of interaction is not ideal for me, it seems better for the community. If they can’t be totally public, I’d rather our recs be semi-public and subject to critique than totally private.
Hi Simon, thanks for writing this! I’m research director at FP, and have a few bullets to comment here in response, but overall just want to indicate that this post is very valuable. I’m also commenting on my phone and don’t have access to my computer at the moment, but can participate in this conversation more energetically (and provide more detail) when I’m back at work next week.
I basically agree with what I take to be your topline finding here, which is that more data is needed before we can arrive at GiveWell-tier levels of confidence about StrongMinds. I agree that a lack of recent follow-ups is problematic from an evaluator’s standpoint and look forward to updated data.
FP doesn’t generally strive for GW-tier levels of confidence; we’re risk-neutral and our general procedure is to estimate expected cost-effectiveness inclusive of deflators for various kinds of subjective consideration, like social desirability bias.
The 2019 report you link (and the associated CEA) is deprecated— FP hasn’t been resourced to update public-facing materials, a situation that is now changing—but the proviso at the top of the page is accurate: we stand by our recommendation.
This is because we re-evaluated StrongMinds last year based on HLI’s research. The new evaluation did things a little bit differently than the old one. Instead of attempting to linearize the relationship between PHQ-9 score reductions and disability weights, we converted the estimated treatment effect into WELLBY-SDs by program and area of operation, an elaboration made possible by HLI’s careful work and using their estimated effect sizes. I reviewed their methodology and was ultimately very satisfied— I expect you may want to dig into this in more detail. This also ultimately allows the direct comparison with cash transfers.
I pressure-tested this a few ways: by comparing with the linearized-DALY approach from our first evaluation, from treating the pre-post data as evidence in a Bayesian update on conservative priors, and by modeling the observed shifts using the assumption of an exponential distribution in PHQ-9 scores (which appears in some literature on the topic). All these methods made the intervention look pretty cost-effective.
Crucially, FP’s bar for recommendation to members is GiveDirectly, and we estimate StrongMinds at roughly 6x GD. So even by our (risk-neutral) lights, StrongMinds is not competitive with GiveWell top charities.
A key piece of evidence is the 2002 RCT on which the program is based. I do think a lot—perhaps too much—hinges on this RCT. Ultimately, however, I think it is the most relevant way to develop a prior on the effectiveness of IPT-g in the appropriate context, which is why a Bayesian update on the pre-post data seems so optimistic: the observed effects are in line with the effect size from the RCT.
The effect sizes observed are very large, but it’s important to place in the context of StrongMinds’ work with severely traumatized populations. Incoming PHQ-9 scores are very, very high, so I think 1) it’s reasonable to expect some reversion to the mean in control groups, which we do see and 2) I’m not sure that our general priors about the low effectiveness of therapeutic interventions are likely to be well-calibrated here.
Overall, I agree that social desirability bias is a very cruxy issue here, and could bear on our rating of StrongMinds. But I think even VERY conservative assumptions about the role of this bias in the effect estimate would cause SM still to clear our bar in expectation.
Just a note on FP and our public-facing research: we’re in the position of prioritizing our research resources primarily to make recommendations to members, but at the same time we’re trying to do our best to provide a public good to the public and the EA community. I think we’re still not sure how to do this, and definitely not sure how to resource for it. We are working on this.
Hey Nick, thanks for this very valuable experience-informed comment. I’m curious what you make of the original 2002 RCT that first tested IPT-G in Uganda. When we (at Founders Pledge) looked at StrongMinds (which we currently recommend, in large part on the back of HLI’s research), I was surprised to see that the results from the original RCT lined up closely with the pre/post scores reported by recent program participants.
Would your take on this result be that participants in the treated group were still basically giving what they saw as socially desirable answers, irrespective of the efficacy of the intervention? It’s true that the control arm in the 2002 RCT did not receive a comparable placebo treatment, so that does seem a reasonable criticism. But if the socially desirability bias is so strong as to account for the massive effect size reported in the 2002 paper, I’d expect it to appear in the NBER paper you cite, which also featured a pure control group. But that paper seems to find no effect of psychotherapy alone.
I agreed with your comment (I found it convincing) but downvoted it because if I was a first-time poster here, I would be much less likely to post again after having my first post characterized as foolish.
As one of many “naive functionalists”, I found the OP was very valuable as a challenge to my thinking, and so I want to come down strongly against discouraging such posts in any way.
These seem like broadly reasonable heuristics, but they kick the can on who is an expert, which is where most of the challenge in deference lies.
The canonical (recent) example of this is COVID, when doctors and epidemiologists, who were perceived by the general public as the relevant experts, weighed in on questions of public policy, in many cases giving the impression of consensus in their communities. I think there is a good argument to be made that public policy “experts” were in fact better-placed to give recommendations in many of these issues. Regardless, it wasn’t at all clear at the time, at least immediately, where the relevant expertise lay.
You might say that this is a problem only for the relatively uninformed, but it seems like it’s an open issue in lots of domains. Depending on how you characterize the expert population, it seems reasonable to me to presume that the number of people who believe AI risk is worth working on could be sub-Lizardman. You might say that this assumes too broad a definition of experts—but this begs the question. Field boundaries are porous, so the question of whether a field is “sound” is itself ill-defined.
(I am research director at FP)
Thanks for all of your work on this analysis, Vasco. We appreciate your thoroughness and your willingness to engage with us beforehand. The work is obviously methodologically sound and, as Johannes indicated, we generally agree that climate is not among the top bets for reducing existential risk.
I think that “mitigating existential risk as cost-effectively as possible” is entailed by the goal of doing as much good as possible in the world, which is why FP exists. To be absolutely clear, FP’s goal is to do the maximum possible amount of good, and to do so in a cause-neutral way.
A common misconception about our research agenda is that it is driven by the interests of our members. This is most assuredly not the case. To some degree, member-driven research was a component of previous iterations of the research team, and our movement away from this is indeed a relatively recent change. There remain some exceptions, but as a general rule we do not devote research resources to any cause area or charity investigation unless we have a good reason to suspect it might be genuinely valuable from a strictly cause-neutral standpoint.
Still, FP does operate under some constraints, one of which is that many of our 1700 members are not cause-neutral. This is by design. We facilitate our members’ charitable giving to all (legal and feasible) grantees in hopes that we can influence some portion of this money toward highly effective ends. This works. Since our members are often not EAs, such giving is strictly counterfactual: in the absence of FP’s recommendations, it simply would not have been given to effective charities.
Climate plays two roles in a portfolio that is constrained in this way. First, it introduces members who are not cause-neutral to our way of thinking about problems and solutions, which builds credibility and opens the door to further education on cause areas that might not immediately resonate with them (e.g. AI risk). This also works. Second, it reallocates non-cause-neutral funds to the most effective opportunities within a cause area in which the vast majority of philanthropic funds are, unfortunately, misspent. As I have tried to work out in my Shortform, this reallocation can be cost-effective under certain conditions even within otherwise unpromising cause areas (of which climate is not one).
Finally, I do want to emphasize that the Climate Fund does not serve a strictly instrumental role. We genuinely think that the climate grants we make and recommend are a comparatively cost-effective way to improve the value of the long-term future, though not the most cost-effective way. I don’t see any particular tension in that: every EA charity evaluator (or grantmaker) recommends (or grants to) options across a wide range of cost-effectiveness. From our perspective, the Climate Fund is better than most things, but not as good as the best things.
Do you have any plans for interoperability with other PPLs or languages for statistical computing? It would be pretty useful to be able to, e.g. write a model in Squiggle and port it easily to R or to PyMC3, particularly if Bayesian updating is not currently supported in Squiggle. I can easily imagine a workflow where we use Squiggle to develop a prior, which we’d then want to update using microdata in, say, Stan (via R).
I very strongly downvoted this comment because I think that personal attacks of any sort have a disproportionately negative impact on the quality of discussion overall, and because responding to a commenter’s identity or background instead of the content of their comment is a bad norm.
Founders Pledge is hiring an Applied Researcher to work with our climate lead evaluating funding opportunities, finding new areas to research within climate, evaluating different theories of change, and granting from FP’s Climate Fund.
We’re open to multiple levels of seniority, from junior researchers all the way up to experienced climate grantmakers. Experience in climate and a familiarity with energy systems is a big plus, but not 100% necessary.
Our job listing is here. Please note that the first round consists of a resume screen and a preliminary task. If the task looks doable to you, I strongly encourage you to complete and submit it along with your resume, no matter what doubts you may have about the applicability of your past experience.
Feel free to message me here with any questions.
Something I’ve considered making myself is a Slackbot for group decision-making: forecasting, quadratic voting, etc. This seems like it would be very useful for lots of organizations and quite a low lift. It’s not the kind of thing that seems easily monetizable at first, but it seems reasonable to expect that if it provides valuable, it could be the kind of thing that people would eventually have to buy “seats” for in larger organizations.
I appreciate your taking the time to write out this idea and the careful thought that went into your post. I liked that it was kind of in the form of a pitch, in keeping with your journalistic theme. I agree that EAs should be thinking more seriously about journalism (in the broadest possible sense) and I think that this is as good a place as any to start. I want to (a) nitpick a few things in your post with an eye to facilitating this broader conversation and (b) point out what I see as an important potential failure mode for an effort like this.
You characterize The Altruist at first as:
a news agency that provides journalistic coverage of EA topics and organisations
This sounds like more or less like a trade publication along the lines of Advertising Age or Publishers Weekly, or perhaps a subject-specific publication oriented more toward the general public, like Popular Science or Nautilus. Generally speaking, I think something like the former is a good idea, though trade publications are generally targeted at those working within an industry. I will describe later on why I am not sure the latter is feasible.
But you go on to say:
Other rough comparisons include The Atlantic, The Economist, the New Yorker, Current Affairs, Works in Progress, and Unherd
These publications are very different from each other. The Economist (where, full disclosure, I worked for a short time) is a general interest newspaper with a print circulation of ~1 million. The New Yorker is a highbrow weekly magazine known for its longform journalistic content. The Atlantic is an eclectic monthly that leans heavily on its regular output of short-form, nonreported digital content. Current Affairs is a bimonthly political magazine with an explicitly left-wing cultural and political agenda. Works in Progress is small, completely online, wholly dedicated to progress studies, and generally nonreported.
Unherd is evidently constructed in opposition to various trends and themes in mainstream political and cultural discourse, and its goal is to disrupt the homogeneity of that discourse. I really enjoy it, but I worry that it sometimes typifies the failure mode I’m worried about. Broadly, that failure mode is this: by defining itself in opposition to the dominant way of thinking, an outlet can sort potential readers out of being interested.
Consider: if a media outlet mainly publishes content that conflicts with the modal narrative, then the modal reader encountering it will find mostly content that challenges their views. I think it is a pernicious but nonetheless reliable feature of the media landscape that most readers who stumble onto such a publication will typically stumble off immediately to another, more comfortable one. I worry that a lot of EA is challenging enough that this could happen with something like The Altruist.
This may actually be fine- that’s why I harp on the precision of the comparison classes: I think Works in Progress, for instance, is likely to serve the progress studies community very well in the years to come, and an EA version of that would serve well the initial goal you describe of improving resources for outreach. But I don’t think that it would do a particularly good job of mitigating reputational risk or increasing community growth, because it would be a niche publication that might find it difficult to earn the trust of readers who find EA ideas challenging (in my experience, this is most people).
So I think as far as new publications go, we may have to pick between the various goals you have helpfully laid out here. But my aspirations for EA in journalism are a bit higher. Here’s my question: what is an EA topic? It is not really obvious to me that there is such a thing. To most people, it is not intuitive, even when you explain, that there is something that ties together (for instance) worrying about AI risk, donating to anti-malaria charities, supporting human challenge trials, and eating vegan.
This is because EA is a way of approaching questions about how to do good in the world, not a collection of answers to those questions.
So my aspiration for journalism in general is not only that it more enthusiastically tackle those issues which this small and idiosyncratic community of people has determined is important. I also think it would be good if journalism in general moved in a more EA-aligned or EA-aware direction on all questions. I think that, counterfactually, the past two decades of journalism in the developed world would look very different if the criterion for newsworthiness was more utilitarian, and if editorial judgments more robustly modeled truth-seeking behavior. Consequently my (weak, working) hypothesis is that the world would be a better place. I also think such a world would be an easier place to grow the community, to combat bad-faith criticism, and to absorb and respond to good-faith critique.
One way to try to make this happen today would be to run a general-interest publication with an editorial position that is openly EA, much as The Economist’s editorial slant is classically liberal. Such a publication would have to cover everything, not just deworming and the lives of people in the far future. But it would, of course, cover those things too.
To bring things back down to the actual topic of conversation: the considerations you have raised here are the right ones. My core concern is that a publication like this will try to do too many things at once, and the reason I’ve written so much above is to try to articulate some additional considerations that I hope will be useful in narrowing down its purpose.
While I’m skeptical about the idea that particular causes you’ve mentioned could truly end up being cost effective paths to reducing suffering, I’m sympathetic to the idea that improving the effectiveness of activity in putatively non-effective causes is potentially itself effective. What interventions do you have in mind to improve effectiveness within these domains?
Now that you’ve given examples, can you provide an account of how increased funding in these areas can lead to improved well-being / preserves lives or DALYs / etc in expectation? Do you expect that targeted funds could be cost-competitive with GW top charities or likewise?
This was also gratifying for us to see, but it’s probably important to note that our approach incorporates weights from both GiveWell and HLI at different points, so the estimates are not completely independent.