Here’s my (working) model. I’m not taking a position on how to classify HLI’s past mistakes or whether applying the model to HLI is warranted, but I think it’s helpful to try to get what seems to be happening out in the open.
Caveat: Some of the paragraphs relie more heavily on my assumptions, extrapolations, suggestions about the “epistemic probation” concept rather than my read of the comments on this and other threads. And of course that concept should be seen mostly as a metaphor.
Some people think HLI made some mistakes that impact their assessment of HLI’s epistemic quality (e.g., some combination of not catching clear-cut model errors that were favorable to its recommended intervention, a series of modeling choices that while defensible were as a whole rather favorable to the same, some overconfident public statements).
Much of the concern here seems to be that HLI may be engaged in motivated reasoning (which could be 100% unconscious!) on the theory that its continued viability as an organization is dependent on producing some actionable results within the first few years of its existence.
These mistakes have updated the people’s assessment of HLI’s epistemic quality to change their view of HLI from “standard” to “on epistemic probation”—I made that term up, it is fleshed out below.
An organization on epistemic probation should expect greater scrutiny of its statements and analyses, and should not expect the same degree of grace / benefit of the doubt that organizations in standard status will get. These effects would seem to logically follow from the downgrade in priors about epistemic quality referenced in (1).
Whie on probation, an organization will be judged more strictly for mild-to-moderate epistemic faults. Here, that would include (e.g.) the statement James expressed concern about.
Practically, that means that the organization should err on the side of being conservative in its assertions, should devote extra resources toward red-teaming its reports, etc. While these steps may slow impact, they are necessary to demonstrate the organization’s good epistemics and to restore community confidence in its outputs.
An organization can exit epistemic probation by demonstrating that its current epistemics are solid over a sufficient period of time, and that it has controls in place to prevent a recurrence of whatever led to its placement on probation in the first place. In other words, subsequent actions need to justify a re-updating of priors to place the organization back into the “standard” zone of confidence in epistemic soundness. An apology will usually be necessary but not sufficient.
For HLI, the exit plan probably includes producing a new transparent, solid CEA of StrongMinds that stands up to external scrutiny. (Withdrawing that CEA might also work.)
It probably also includes a showing that sufficient internal or external controls are now in place to minimize the risk of recurrence. This could be a commitment to external peer review of the revised StrongMinds CEA as well as other new major recommendations and the reports on which they are based, a commitment to offer bounties for catching mistakes in major CEAs (with a third-party adjudicator), etc., etc.
Finally, the exit plan probably includes a period of consistently not making statements on the Forum, its website, and other arenas that seem to be a stretch based on the underlying evidence.
Of course, HLI’s funding position makes it more challenging for it to meet some of these steps to exit probation. Conditional on HLI having properly been placed on probation, I don’t know to what extent the existence of financial constraints should alter the quantum of evidence necessary to remove it from probation.
I think the concept of epistemic probation is probably useful. It is important to police this sort of thing. Epistemic probation gives the organization a chance to correct the perceived problem, and gives the community an action to take in response to problems it deems significant that isn’t excluding the organization from the community.
For better and for worse, each of us have to decide for ourselves whether an organization is on epistemic probation in our eyes. This poses a problem, because the organization may not realize a number of people have placed it on epistemic probation. So while I don’t like the tone or some of the contents of certain comments, I think it’s critical that the community provides feedback to organizations that puts them on notice of their probationary status in the eyes of many people. If many people silently place an organization on probation, and the organization fails probation (perhaps due to not knowing it was in hot water), then those people are going to treat the organization as excluded for its epistemic failures. That’s a bad outcome for all involved.[1]
One other point, which is also more challenging due to decentralization: The end goal of probation is restoration to good standing, and so it needs to be clear to the organization what it needs to do (and avoid doing) in order to exit probation. I tried to model this in points 6(a) to 6(c) above [conditioned on my assumptions about why people have HLI on probation], as well as in the example to my comment to Greg about whether HLI has been “maintain[ing]” its position after errors were pointed out. Of course, different people who have placed HLI on probation would have different opinions on what is necessary for HLI to exit that status.
Some people may have already decided to treat HLI as excluded, but my hunch is that these people are fairly small in number compared to the number who have HLI on probation.
[I don’t plan make any (major) comments on this thread after today. It’s been time-and-energy intensive and I plan to move back to other priorities]
Hello Jason,
I really appreciated this comment: the analysis was thoughtful and the suggestions constructive. Indeed, it was a lightbulb moment. I agree that some people do have us on epistemic probation, in the sense they think it’s inappropriate to grant the principle of charity, and should instead look for mistakes (and conclude incompetence or motivated reasoning if they find them).
I would disagree that HLI should be on epistemic probation, but I am, of course, at risk of bias here, and I’m not sure I can defend our work without coming off as counter-productively defensive! That said, I want to make some comments that may help others understand what’s going on so they can form their own view, then set out our mistakes and what we plan to do next.
Context
I suspect that some people have had HLI on epistemic probation since we started—for perhaps understandable reasons. These are:
We are advancing a new methodology, the happiness/SWB/WELLBY approach. Although there are decades of work in social science on this and it’s now used by the UK government, this was new to most EAs and they could ask, “if it’s so good, why aren’t we already doing it?” Of course, new ideas have to start sometime.
HLI is a second-generation EA org that is setting out to publicly re-assess some conclusions of an existing (understandably!) well-beloved first-generation org, GiveWell. I can’t think of another case like this; usually, EA orgs do non-overlapping work. Some people have welcomed us offering a different perspective, others have really not liked it; we’ve clearly ruffled some feathers.
As a result of 1 and 2, there is something of a status quo effect and scepticism that wouldn’t be the case if we were offering recommendations in a new area for the first time. To illustrate, suppose you know nothing about global health and wellbeing and someone tells you they’ve done lots of research based on happiness measures and they’ve found cash transfers are good, treating depression is about 7x as good as cash, deworming has no clear long-run effect, and life-saving bednets are 1-8x cash depending on difficult moral assumptions. I expect most people would say “yeah, that seems reasonable” rather than “why are engaged in motivated reasoning?”.
Our mistakes (so far)
The discussion in this thread has been a bit vague about what mistakes HLI has made that have led to suspicion. I want to set out what, from my perspective, those are. I reserve the right to add things to this list! We’ll probably put a version of this on our website.
This was the first substantive empirical criticism we received. We had noted in the original report that not including spillovers was a limitation in the analysis, but we hadn’t explicitly modelled them. This was for a couple of reasons. We hadn’t seen any other EA org empirically model spillovers, so it seemed an non-standard thing to do, and the data were low-quality anyway, so we hadn’t thought much about including them. We were surprised when some claimed this was a serious (possibly deliberate) omission.
That said, we took the objection very seriously and reallocated several months of staff time in early 2022 from other topics to produce the best spillovers analysis we could on the available data, which we then shared with others. In the end, it only somewhat reduced the result (therapy went from 12x cash to 9x).
At that point, we incorporated nearly all the available data into our cash and psychotherapy meta-analyses, accounted for spillovers, plus looked at deworming (for which long-term effects on wellbeing are non-significant) and life-extending vs life-saving interventions (where psychotherapy seemed better under almost all assumptions). So we felt proud of our work and quite confident.
In retrospect, as I’ve alluded to before, we were overconfident, our language and execution were clumsy, and this really annoyed some people. I’m sorry about this and I hope people can forgive us. We have since spent some time internally thinking about how to communicate our confidence in our conclusions.
3. Not communicating better how we’d done our meta-analysis of psychotherapy, including that we hadn’t taken StrongMinds’ own studies at face value.
SimonM’s post has been mentioned a few times in this thread. As I mentioned in point 3 here, SimonM criticised the recommendation of StrongMinds based on concerns about StrongMinds’ own study, not our analysis. He said he didn’t engage with our analysis because he was ‘confused’ about methodology but that, in any case “key thing about HLI methodology is that [it] follows the same structure as the Founders Pledge analysis and so all the problems I mention above regarding data apply just as much to them as FP”. However, our evaluation didn’t have the problems he was referring to because of how we’d done the meta-analysis.
In retrospect, it seems the fact we’d done a meta-analysis, and not put much weight on StrongMinds’ own study, wasn’t something people knew, and we should have communicated that much more prominently; it was buried in some super longposts. We need to own our inadequate comms there. It was tough to learn he and some other members of EA have been thinking of us with such suspicion. Psychologically, the team took this very hard.
4. We made some errors in the spillovers analysis (as pointed out by James Snowden).
The main error here was that, as my colleague Joel conceded (“I blundered”) he coded some data the wrong way and this reduced the result from 9x to 7.5x cash transfers. This is embarrassing but not, I think, sinister by itself. These things happen, they’re awkward, but not well explained by motivated reasoning: coding errors are checkable and, in any case, the result is unchanged with the correction (see my comment here too)
I recognise that some will think this a catalogue of errors best explained by a corrupting agenda; the reader must make up their own mind. Two of the four are analysis errors of the sort that routinely appear when researchers review each other’s work. Two are errors in communication, either about being overconfident, or not communicating enough.
Next steps:
Jason suggests those on epistemic probation should provide a credible exit plan. Leaving aside whether we are, or should be, on epistemic probation, I am happy to set out what we plan to do next. For our research regarding reevaluating psychotherapy, we had already set this out in our new research agenda, at Section 2.1, which we published at the same time as this post. We are still committed to digging into the details of this analysis that have been brought up.
About bounties: I like this idea and wish we could implement it, but in light of our funding position, I don’t think we’ll be able to do so in the near-term.
In addition, we’ll consider adding something like an ‘Our mistakes’ page to our website to chronicle our blunders. At the least, we’ll add a version history to our cost-effectiveness analysis so people can see how the numbers have changed over time and why.
I am open to—indeed, I welcome—further constructive suggestions about what work people would like us to do to change their minds and/or reassure them. I do ask that these are realistic: as noted, we are a small, funding-and-capacity-constrained team with a substantial research agenda. We therefore might not be able to take all suggestions on board.
I think your last sentence is critical—coming up with ways to improve epistemic practices and legibility is a lot easier where there are no budget constraints! It’s hard for me to assess cost vs. benefit for suggestions, so the suggestions below should be taken with that in mind.
For any of HLI’s donors who currently have it on epistemic probation: Getting out of epistemic probation generally requires additional marginal resources. Thus, it generally isn’t a good idea to reduce funding based on probationary status. That would make about as much sense as “punishing” a student on academic probation by taking away their access to tutoring services they need to improve.
The suggestions below are based on the theory that the main source of probationary status—at least for individuals who would be willing to lift that status in the future—is the confluence of the overstated 2022 communications and some issues with the SM CEA. They lean a bit toward “cleaner and more calibrated public communication” because I’m not a statistican, but also because I personally value that in assessing the epistemics of an org that makes charity recommendations to the general public. I also lean in that direction because I worry that setting too many substantive expectations for future reports will unduly suppress the public release of outputs.
I am concerned that HLI is at risk of second-impact syndrome and would not, as a practical matter, survive a set of similar mistakes on the re-analysis of SM or on its next few major recommendations. For that reason, I have not refrained from offering suggestions based on my prediction that they could slow down HLI’s plans to some extent, or incur moderately significant resource costs.
All of these come from someone who wants HLI to succeed. I think we need to move future conversations about HLI in a “where do we go from here” direction rather than spending a lot of time and angst re-litigating the significance and import of previously-disclosed mistakes.[1] I’m sure this thread has already consumed a lot of HLI’s limited time; I certainly do not expect a reply.
A: Messaging Calibration
For each research report, you could score and communicate the depth/thoroughness of the research report, the degree of uncertainty, and the quality of the available evidence. For the former, the scale could be something like 0 = Don’t spend more than $1 of play money on this; 10 = We have zero hesitation with someone committing > $100MM on this without further checking. For the materials you put out (website materials, Forum posts, reports), the material should be consistent with your scores. Even better, you could ask a few outside people to read draft materials (without knowing the scores) and tell you what scores the material implies to them.
I think it’s perfectly OK for an org to put out material that has some scores of 4 or 5 due to resource constraints, deprioritization due to limited room for funding or unpromising results, etc. Given its resources, its scope of work, the areas it is researching, and the state of other work in those areas, I don’t think HLI can realistically aim for scores of 9 or a 10 across the board in the near future. But the messaging needs to match the scores. In fact, I might aim for messaging that is slightly below the scores. I say that because the 2022 Giving Season materials suggest HLI’s messaging “scale” may be off, and adding a tare weight could serve as an interim fix.
I think HLI is in a challenging spot given GiveWell’s influence and resources. I further think that most orgs in HLI’s position would feel a need to “compete” with GiveWell, and that some of the 2022 messaging suggests that may be the case. I think that pressure would put most orgs at risk of projecting more confidence and certainty than the data allow, and so it’s particularly important that orgs facing that kind of pressure carefully calibrate their messaging.
B: Identification of Major Hinges
For each recommendation, there could be a page on major hinges, assumptions, methodological critical points, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. For bonus points, you could code an app that allows the user to see how the results change based on various hinges. For example, for the SM recommendation, I would have liked to see things like the material below. (Note that some examples are based on posted criticisms of the SM CEA, but the details are not meant to be taken literally.)
X% of the projected impact comes from indirect effects on family members (“spillovers”), for which the available research is limited. We estimate that each family member benefits 38% as much as the person receiving services. See pages ___ of our report for more information. Even a moderate change in this estimate could significantly change our estimate of WELLBYs per $1,000 spent.
In estimating the effect of the SM program, we included the results of two studies conducted by StrongMinds of unknown quality. These studies showed significantly better results than most others, and the result of one study is approaching the limits of plausibility. If we had instead decided to give these two studies zero credence in our model, our estimate of SM’s impact would have decreased by Y%.[2] See pages ___ of our report for more information.
We considered 39 studies in estimating SM’s effect size. There was a significantly wider than expected variance in the effects reported by the studies (“heterogenity”), which makes analysis more difficult. About C% of the reported effect is based on five specific studies. Moreover, there were signs that higher quality studies showed lower effects. Although we attempted to correct for these issues, it is possible that we did not fully succeed. We subjectively estimate there is at least a 10% chance that our estimate is at least 20% too high due to these effects. See pages ___ of our report for more information.
The data show a moderately pronounced Gandalf effect. There are two generally accepted ways to address a Gandalf effect. We used a Gondor correction for the reasons described at pages ___ of our report. However, using a Rohan correction would have been a reasonable alternative and would have reduced the estimated impact by 11%.
Presumably you would already know where the hinges and critical values were, so listing them in lay-readable form shouldn’t require too much effort. But doing so protects against people getting the impression that the overall conclusion isn’t appropriately caveated, that you didn’t make it clear enough how much role study A or factor B played, etc. Of course, this section could list positive factors too (e.g., we used the Rohan correction even though it was a close call and the Gondor correction would have boosted impact 11%).
C: Red-Teaming and Technical Appendix
In my field (law), we’re taught that you do not want the court to learn about unfavorable facts or law only from your opponents’ brief. Displaying up front that you saw an issue rules out two possible unfavorable inferences a reader could draw: that you didn’t see the issue, or that you saw the issue and hoped neither the court nor the other side’s lawyer would notice. Likewise, more explicit recognition of certain statistical information in a separate document may be appropriate, especially in an epistemic-probation situation. I do recognize that this could incur some costs.
I’m not a statistican by any means, but to the extent that you would might expect an opposition research team to express significant concern about a finding—such as the pre-registered reports showing much lower effect sizes than the unregistered ones—I think it would be helpful to acknowledge and respond to that concern upfront. I recognize that potentially calls for a degree of mind-reading, and that this approach may not work if the critics dig for more arcane stuff. But even if the critics find something that the red team didn’t, the disclosure of some issues in a technical appendix still legibly communicates a commitment to self-critical analysis.
D: Correction Listing and Policy
For each recommendation, there could be a page for issues, corrections, subsequent developments, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. There could also be a policy that explains what sorts of issues will trigger an entry on that page and the timeframe in which information will be added, as well as trigger criteria for conspiciously marking the recommendation/report as under review, withdrawing it pending further review, and so on. The policy should be in effect for as long as there is a recommendation based on the report, or for a minimum of G years (unless the report and any recommendation are formally withdrawn).
The policy would need to include a definition of materiality and clearly specified claims. Claims could be binary (SM cost-effectiveness > GiveDirectly) or quantitative (SM cost-effectiveness = 7.5X GiveDirectly). A change could be defined as material if it changed the probability of a binary claim more than Y% or changed a quantitative claim more than Z%. It could provide that any new issue will be added to the issues page within A days of discovery unless it is determined that the issue is not reasonably likely (at least Q% chance) to be material. It could provide that there will be a determination of materiality (and updated credences or estimates as necessary) within B days. The policy could describe which website materials, etc. would need to be corrected based on the degree of materiality.
If for some reason the time limit for full adjudication cannot be met, then all references to that claim on HLI’s website, the Forum, etc. need to be clearly marked as [UNDER REVIEW] or pulled so that the reader won’t be potentially mislead by the material. In addition, all materials need to be marked [UNDER REVIEW] if at any time there is a substantial possibility (at least J%) that the claim will ultimately be withdrawn.
This idea is ultimately intended to be about calibration and clear communication. If an org commits, in advance, to certain clear claims and a materiality definition, then the reader can compare those commitments against the organization’s public-facing statements and read them accordingly. For instance, if the headline number is 8X cash, but the org will only commit to following correction procedures if that dips below 4X cash, that tells the reader something valuable.
This is loosely akin to a manufacturer’s warranty, which can be as important as a measure of the manufacturer’s confidence in the product as anything else. I recognize that larger orgs will find it easier to make corrections in a timely manner, and the community needs to give HLI more grace (both in terms of timelines and probably materiality thresholds) than it would give a larger organization.
Likewise, a policy stated in advance provides a better way to measure whether the organization is dealing appropriately with issues versus digging in its heels. It can commit the organization to make concrete adjustments to its claims or to affirm a position that any would-be changes do not meet pre-determined criteria. Hopefully, this would avoid—or at least focus—any disputes about whether the organization is inappropriately maintaining its position. Planting the goalposts in advance also cuts off any disputes about whether the org is moving the goalposts in response to criticism.
[two more speculative paragraphs here!] Finally, the policy could provide for an appeal of certain statistical/methodological issues to a independent non-EA expert panel by a challenger who found the HLI’s application of its correction policy incorrect. Costs would be determined by the panel based on its ruling. HLI would update its materials with any adverse finding, and prominently display any finding by the panel that it had made an unreasonable application under its policy (which is not the same as the panel agreeing with the challenger).
This might be easier to financially justify than a bounty program because it only creates exposure if there is a material error, HLI swings and misses on the opportunity to correct it, and the remaining error is clear enough for a challenger to risk money. I am generally skeptical of “put your own money at risk” elements in EA culture for various reasons, but I don’t think the current means of dispute resolution are working well for either HLI or the community.
I could imagine that you get more people interested in providing funding if you pre-commit to doing things like bug bounties conditional on getting a certain amount of funding. Does this seem likely to you?
I really like this concept of epistemic probation—I agree also on the challenges of making it private and exiting such a state. Making exiting criticism-heavy periods easier probably makes it easier to levy in the first place (since you know that it is escapable).
Here’s my (working) model. I’m not taking a position on how to classify HLI’s past mistakes or whether applying the model to HLI is warranted, but I think it’s helpful to try to get what seems to be happening out in the open.
Caveat: Some of the paragraphs relie more heavily on my assumptions, extrapolations, suggestions about the “epistemic probation” concept rather than my read of the comments on this and other threads. And of course that concept should be seen mostly as a metaphor.
Some people think HLI made some mistakes that impact their assessment of HLI’s epistemic quality (e.g., some combination of not catching clear-cut model errors that were favorable to its recommended intervention, a series of modeling choices that while defensible were as a whole rather favorable to the same, some overconfident public statements).
Much of the concern here seems to be that HLI may be engaged in motivated reasoning (which could be 100% unconscious!) on the theory that its continued viability as an organization is dependent on producing some actionable results within the first few years of its existence.
These mistakes have updated the people’s assessment of HLI’s epistemic quality to change their view of HLI from “standard” to “on epistemic probation”—I made that term up, it is fleshed out below.
An organization on epistemic probation should expect greater scrutiny of its statements and analyses, and should not expect the same degree of grace / benefit of the doubt that organizations in standard status will get. These effects would seem to logically follow from the downgrade in priors about epistemic quality referenced in (1).
Whie on probation, an organization will be judged more strictly for mild-to-moderate epistemic faults. Here, that would include (e.g.) the statement James expressed concern about.
Practically, that means that the organization should err on the side of being conservative in its assertions, should devote extra resources toward red-teaming its reports, etc. While these steps may slow impact, they are necessary to demonstrate the organization’s good epistemics and to restore community confidence in its outputs.
An organization can exit epistemic probation by demonstrating that its current epistemics are solid over a sufficient period of time, and that it has controls in place to prevent a recurrence of whatever led to its placement on probation in the first place. In other words, subsequent actions need to justify a re-updating of priors to place the organization back into the “standard” zone of confidence in epistemic soundness. An apology will usually be necessary but not sufficient.
For HLI, the exit plan probably includes producing a new transparent, solid CEA of StrongMinds that stands up to external scrutiny. (Withdrawing that CEA might also work.)
It probably also includes a showing that sufficient internal or external controls are now in place to minimize the risk of recurrence. This could be a commitment to external peer review of the revised StrongMinds CEA as well as other new major recommendations and the reports on which they are based, a commitment to offer bounties for catching mistakes in major CEAs (with a third-party adjudicator), etc., etc.
Finally, the exit plan probably includes a period of consistently not making statements on the Forum, its website, and other arenas that seem to be a stretch based on the underlying evidence.
Of course, HLI’s funding position makes it more challenging for it to meet some of these steps to exit probation. Conditional on HLI having properly been placed on probation, I don’t know to what extent the existence of financial constraints should alter the quantum of evidence necessary to remove it from probation.
I think the concept of epistemic probation is probably useful. It is important to police this sort of thing. Epistemic probation gives the organization a chance to correct the perceived problem, and gives the community an action to take in response to problems it deems significant that isn’t excluding the organization from the community.
For better and for worse, each of us have to decide for ourselves whether an organization is on epistemic probation in our eyes. This poses a problem, because the organization may not realize a number of people have placed it on epistemic probation. So while I don’t like the tone or some of the contents of certain comments, I think it’s critical that the community provides feedback to organizations that puts them on notice of their probationary status in the eyes of many people. If many people silently place an organization on probation, and the organization fails probation (perhaps due to not knowing it was in hot water), then those people are going to treat the organization as excluded for its epistemic failures. That’s a bad outcome for all involved.[1]
One other point, which is also more challenging due to decentralization: The end goal of probation is restoration to good standing, and so it needs to be clear to the organization what it needs to do (and avoid doing) in order to exit probation. I tried to model this in points 6(a) to 6(c) above [conditioned on my assumptions about why people have HLI on probation], as well as in the example to my comment to Greg about whether HLI has been “maintain[ing]” its position after errors were pointed out. Of course, different people who have placed HLI on probation would have different opinions on what is necessary for HLI to exit that status.
Some people may have already decided to treat HLI as excluded, but my hunch is that these people are fairly small in number compared to the number who have HLI on probation.
[I don’t plan make any (major) comments on this thread after today. It’s been time-and-energy intensive and I plan to move back to other priorities]
Hello Jason,
I really appreciated this comment: the analysis was thoughtful and the suggestions constructive. Indeed, it was a lightbulb moment. I agree that some people do have us on epistemic probation, in the sense they think it’s inappropriate to grant the principle of charity, and should instead look for mistakes (and conclude incompetence or motivated reasoning if they find them).
I would disagree that HLI should be on epistemic probation, but I am, of course, at risk of bias here, and I’m not sure I can defend our work without coming off as counter-productively defensive! That said, I want to make some comments that may help others understand what’s going on so they can form their own view, then set out our mistakes and what we plan to do next.
Context
I suspect that some people have had HLI on epistemic probation since we started—for perhaps understandable reasons. These are:
We are advancing a new methodology, the happiness/SWB/WELLBY approach. Although there are decades of work in social science on this and it’s now used by the UK government, this was new to most EAs and they could ask, “if it’s so good, why aren’t we already doing it?” Of course, new ideas have to start sometime.
HLI is a second-generation EA org that is setting out to publicly re-assess some conclusions of an existing (understandably!) well-beloved first-generation org, GiveWell. I can’t think of another case like this; usually, EA orgs do non-overlapping work. Some people have welcomed us offering a different perspective, others have really not liked it; we’ve clearly ruffled some feathers.
As a result of 1 and 2, there is something of a status quo effect and scepticism that wouldn’t be the case if we were offering recommendations in a new area for the first time. To illustrate, suppose you know nothing about global health and wellbeing and someone tells you they’ve done lots of research based on happiness measures and they’ve found cash transfers are good, treating depression is about 7x as good as cash, deworming has no clear long-run effect, and life-saving bednets are 1-8x cash depending on difficult moral assumptions. I expect most people would say “yeah, that seems reasonable” rather than “why are engaged in motivated reasoning?”.
Our mistakes (so far)
The discussion in this thread has been a bit vague about what mistakes HLI has made that have led to suspicion. I want to set out what, from my perspective, those are. I reserve the right to add things to this list! We’ll probably put a version of this on our website.
1. Not modelling spillovers in our cash vs psychotherapy meta-analyses.
This was the first substantive empirical criticism we received. We had noted in the original report that not including spillovers was a limitation in the analysis, but we hadn’t explicitly modelled them. This was for a couple of reasons. We hadn’t seen any other EA org empirically model spillovers, so it seemed an non-standard thing to do, and the data were low-quality anyway, so we hadn’t thought much about including them. We were surprised when some claimed this was a serious (possibly deliberate) omission.
That said, we took the objection very seriously and reallocated several months of staff time in early 2022 from other topics to produce the best spillovers analysis we could on the available data, which we then shared with others. In the end, it only somewhat reduced the result (therapy went from 12x cash to 9x).
2. We were too confident and clumsy in our 2022 Giving Season post.
At that point, we incorporated nearly all the available data into our cash and psychotherapy meta-analyses, accounted for spillovers, plus looked at deworming (for which long-term effects on wellbeing are non-significant) and life-extending vs life-saving interventions (where psychotherapy seemed better under almost all assumptions). So we felt proud of our work and quite confident.
In retrospect, as I’ve alluded to before, we were overconfident, our language and execution were clumsy, and this really annoyed some people. I’m sorry about this and I hope people can forgive us. We have since spent some time internally thinking about how to communicate our confidence in our conclusions.
3. Not communicating better how we’d done our meta-analysis of psychotherapy, including that we hadn’t taken StrongMinds’ own studies at face value.
SimonM’s post has been mentioned a few times in this thread. As I mentioned in point 3 here, SimonM criticised the recommendation of StrongMinds based on concerns about StrongMinds’ own study, not our analysis. He said he didn’t engage with our analysis because he was ‘confused’ about methodology but that, in any case “key thing about HLI methodology is that [it] follows the same structure as the Founders Pledge analysis and so all the problems I mention above regarding data apply just as much to them as FP”. However, our evaluation didn’t have the problems he was referring to because of how we’d done the meta-analysis.
In retrospect, it seems the fact we’d done a meta-analysis, and not put much weight on StrongMinds’ own study, wasn’t something people knew, and we should have communicated that much more prominently; it was buried in some super long posts. We need to own our inadequate comms there. It was tough to learn he and some other members of EA have been thinking of us with such suspicion. Psychologically, the team took this very hard.
4. We made some errors in the spillovers analysis (as pointed out by James Snowden).
The main error here was that, as my colleague Joel conceded (“I blundered”) he coded some data the wrong way and this reduced the result from 9x to 7.5x cash transfers. This is embarrassing but not, I think, sinister by itself. These things happen, they’re awkward, but not well explained by motivated reasoning: coding errors are checkable and, in any case, the result is unchanged with the correction (see my comment here too)
I recognise that some will think this a catalogue of errors best explained by a corrupting agenda; the reader must make up their own mind. Two of the four are analysis errors of the sort that routinely appear when researchers review each other’s work. Two are errors in communication, either about being overconfident, or not communicating enough.
Next steps:
Jason suggests those on epistemic probation should provide a credible exit plan. Leaving aside whether we are, or should be, on epistemic probation, I am happy to set out what we plan to do next. For our research regarding reevaluating psychotherapy, we had already set this out in our new research agenda, at Section 2.1, which we published at the same time as this post. We are still committed to digging into the details of this analysis that have been brought up.
About bounties: I like this idea and wish we could implement it, but in light of our funding position, I don’t think we’ll be able to do so in the near-term.
In addition, we’ll consider adding something like an ‘Our mistakes’ page to our website to chronicle our blunders. At the least, we’ll add a version history to our cost-effectiveness analysis so people can see how the numbers have changed over time and why.
I am open to—indeed, I welcome—further constructive suggestions about what work people would like us to do to change their minds and/or reassure them. I do ask that these are realistic: as noted, we are a small, funding-and-capacity-constrained team with a substantial research agenda. We therefore might not be able to take all suggestions on board.
I think your last sentence is critical—coming up with ways to improve epistemic practices and legibility is a lot easier where there are no budget constraints! It’s hard for me to assess cost vs. benefit for suggestions, so the suggestions below should be taken with that in mind.
For any of HLI’s donors who currently have it on epistemic probation: Getting out of epistemic probation generally requires additional marginal resources. Thus, it generally isn’t a good idea to reduce funding based on probationary status. That would make about as much sense as “punishing” a student on academic probation by taking away their access to tutoring services they need to improve.
The suggestions below are based on the theory that the main source of probationary status—at least for individuals who would be willing to lift that status in the future—is the confluence of the overstated 2022 communications and some issues with the SM CEA. They lean a bit toward “cleaner and more calibrated public communication” because I’m not a statistican, but also because I personally value that in assessing the epistemics of an org that makes charity recommendations to the general public. I also lean in that direction because I worry that setting too many substantive expectations for future reports will unduly suppress the public release of outputs.
I am concerned that HLI is at risk of second-impact syndrome and would not, as a practical matter, survive a set of similar mistakes on the re-analysis of SM or on its next few major recommendations. For that reason, I have not refrained from offering suggestions based on my prediction that they could slow down HLI’s plans to some extent, or incur moderately significant resource costs.
All of these come from someone who wants HLI to succeed. I think we need to move future conversations about HLI in a “where do we go from here” direction rather than spending a lot of time and angst re-litigating the significance and import of previously-disclosed mistakes.[1] I’m sure this thread has already consumed a lot of HLI’s limited time; I certainly do not expect a reply.
A: Messaging Calibration
For each research report, you could score and communicate the depth/thoroughness of the research report, the degree of uncertainty, and the quality of the available evidence. For the former, the scale could be something like 0 = Don’t spend more than $1 of play money on this; 10 = We have zero hesitation with someone committing > $100MM on this without further checking. For the materials you put out (website materials, Forum posts, reports), the material should be consistent with your scores. Even better, you could ask a few outside people to read draft materials (without knowing the scores) and tell you what scores the material implies to them.
I think it’s perfectly OK for an org to put out material that has some scores of 4 or 5 due to resource constraints, deprioritization due to limited room for funding or unpromising results, etc. Given its resources, its scope of work, the areas it is researching, and the state of other work in those areas, I don’t think HLI can realistically aim for scores of 9 or a 10 across the board in the near future. But the messaging needs to match the scores. In fact, I might aim for messaging that is slightly below the scores. I say that because the 2022 Giving Season materials suggest HLI’s messaging “scale” may be off, and adding a tare weight could serve as an interim fix.
I think HLI is in a challenging spot given GiveWell’s influence and resources. I further think that most orgs in HLI’s position would feel a need to “compete” with GiveWell, and that some of the 2022 messaging suggests that may be the case. I think that pressure would put most orgs at risk of projecting more confidence and certainty than the data allow, and so it’s particularly important that orgs facing that kind of pressure carefully calibrate their messaging.
B: Identification of Major Hinges
For each recommendation, there could be a page on major hinges, assumptions, methodological critical points, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. For bonus points, you could code an app that allows the user to see how the results change based on various hinges. For example, for the SM recommendation, I would have liked to see things like the material below. (Note that some examples are based on posted criticisms of the SM CEA, but the details are not meant to be taken literally.)
X% of the projected impact comes from indirect effects on family members (“spillovers”), for which the available research is limited. We estimate that each family member benefits 38% as much as the person receiving services. See pages ___ of our report for more information. Even a moderate change in this estimate could significantly change our estimate of WELLBYs per $1,000 spent.
In estimating the effect of the SM program, we included the results of two studies conducted by StrongMinds of unknown quality. These studies showed significantly better results than most others, and the result of one study is approaching the limits of plausibility. If we had instead decided to give these two studies zero credence in our model, our estimate of SM’s impact would have decreased by Y%.[2] See pages ___ of our report for more information.
We considered 39 studies in estimating SM’s effect size. There was a significantly wider than expected variance in the effects reported by the studies (“heterogenity”), which makes analysis more difficult. About C% of the reported effect is based on five specific studies. Moreover, there were signs that higher quality studies showed lower effects. Although we attempted to correct for these issues, it is possible that we did not fully succeed. We subjectively estimate there is at least a 10% chance that our estimate is at least 20% too high due to these effects. See pages ___ of our report for more information.
The data show a moderately pronounced Gandalf effect. There are two generally accepted ways to address a Gandalf effect. We used a Gondor correction for the reasons described at pages ___ of our report. However, using a Rohan correction would have been a reasonable alternative and would have reduced the estimated impact by 11%.
Presumably you would already know where the hinges and critical values were, so listing them in lay-readable form shouldn’t require too much effort. But doing so protects against people getting the impression that the overall conclusion isn’t appropriately caveated, that you didn’t make it clear enough how much role study A or factor B played, etc. Of course, this section could list positive factors too (e.g., we used the Rohan correction even though it was a close call and the Gondor correction would have boosted impact 11%).
C: Red-Teaming and Technical Appendix
In my field (law), we’re taught that you do not want the court to learn about unfavorable facts or law only from your opponents’ brief. Displaying up front that you saw an issue rules out two possible unfavorable inferences a reader could draw: that you didn’t see the issue, or that you saw the issue and hoped neither the court nor the other side’s lawyer would notice. Likewise, more explicit recognition of certain statistical information in a separate document may be appropriate, especially in an epistemic-probation situation. I do recognize that this could incur some costs.
I’m not a statistican by any means, but to the extent that you would might expect an opposition research team to express significant concern about a finding—such as the pre-registered reports showing much lower effect sizes than the unregistered ones—I think it would be helpful to acknowledge and respond to that concern upfront. I recognize that potentially calls for a degree of mind-reading, and that this approach may not work if the critics dig for more arcane stuff. But even if the critics find something that the red team didn’t, the disclosure of some issues in a technical appendix still legibly communicates a commitment to self-critical analysis.
D: Correction Listing and Policy
For each recommendation, there could be a page for issues, corrections, subsequent developments, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. There could also be a policy that explains what sorts of issues will trigger an entry on that page and the timeframe in which information will be added, as well as trigger criteria for conspiciously marking the recommendation/report as under review, withdrawing it pending further review, and so on. The policy should be in effect for as long as there is a recommendation based on the report, or for a minimum of G years (unless the report and any recommendation are formally withdrawn).
The policy would need to include a definition of materiality and clearly specified claims. Claims could be binary (SM cost-effectiveness > GiveDirectly) or quantitative (SM cost-effectiveness = 7.5X GiveDirectly). A change could be defined as material if it changed the probability of a binary claim more than Y% or changed a quantitative claim more than Z%. It could provide that any new issue will be added to the issues page within A days of discovery unless it is determined that the issue is not reasonably likely (at least Q% chance) to be material. It could provide that there will be a determination of materiality (and updated credences or estimates as necessary) within B days. The policy could describe which website materials, etc. would need to be corrected based on the degree of materiality.
If for some reason the time limit for full adjudication cannot be met, then all references to that claim on HLI’s website, the Forum, etc. need to be clearly marked as [UNDER REVIEW] or pulled so that the reader won’t be potentially mislead by the material. In addition, all materials need to be marked [UNDER REVIEW] if at any time there is a substantial possibility (at least J%) that the claim will ultimately be withdrawn.
This idea is ultimately intended to be about calibration and clear communication. If an org commits, in advance, to certain clear claims and a materiality definition, then the reader can compare those commitments against the organization’s public-facing statements and read them accordingly. For instance, if the headline number is 8X cash, but the org will only commit to following correction procedures if that dips below 4X cash, that tells the reader something valuable.
This is loosely akin to a manufacturer’s warranty, which can be as important as a measure of the manufacturer’s confidence in the product as anything else. I recognize that larger orgs will find it easier to make corrections in a timely manner, and the community needs to give HLI more grace (both in terms of timelines and probably materiality thresholds) than it would give a larger organization.
Likewise, a policy stated in advance provides a better way to measure whether the organization is dealing appropriately with issues versus digging in its heels. It can commit the organization to make concrete adjustments to its claims or to affirm a position that any would-be changes do not meet pre-determined criteria. Hopefully, this would avoid—or at least focus—any disputes about whether the organization is inappropriately maintaining its position. Planting the goalposts in advance also cuts off any disputes about whether the org is moving the goalposts in response to criticism.
[two more speculative paragraphs here!] Finally, the policy could provide for an appeal of certain statistical/methodological issues to a independent non-EA expert panel by a challenger who found the HLI’s application of its correction policy incorrect. Costs would be determined by the panel based on its ruling. HLI would update its materials with any adverse finding, and prominently display any finding by the panel that it had made an unreasonable application under its policy (which is not the same as the panel agreeing with the challenger).
This might be easier to financially justify than a bounty program because it only creates exposure if there is a material error, HLI swings and misses on the opportunity to correct it, and the remaining error is clear enough for a challenger to risk money. I am generally skeptical of “put your own money at risk” elements in EA culture for various reasons, but I don’t think the current means of dispute resolution are working well for either HLI or the community.
This is not meant to discourage discussions of any new issues with the recommendation or underlying analysis that may be found.
I think this is the fairest way to report this—because the studies were outliers, they may have been hingier than their level of credence.
This was really helpful, thanks! I’ll discuss it with the team.
I could imagine that you get more people interested in providing funding if you pre-commit to doing things like bug bounties conditional on getting a certain amount of funding. Does this seem likely to you?
I really like this concept of epistemic probation—I agree also on the challenges of making it private and exiting such a state. Making exiting criticism-heavy periods easier probably makes it easier to levy in the first place (since you know that it is escapable).