I think your last sentence is critical—coming up with ways to improve epistemic practices and legibility is a lot easier where there are no budget constraints! It’s hard for me to assess cost vs. benefit for suggestions, so the suggestions below should be taken with that in mind.
For any of HLI’s donors who currently have it on epistemic probation: Getting out of epistemic probation generally requires additional marginal resources. Thus, it generally isn’t a good idea to reduce funding based on probationary status. That would make about as much sense as “punishing” a student on academic probation by taking away their access to tutoring services they need to improve.
The suggestions below are based on the theory that the main source of probationary status—at least for individuals who would be willing to lift that status in the future—is the confluence of the overstated 2022 communications and some issues with the SM CEA. They lean a bit toward “cleaner and more calibrated public communication” because I’m not a statistican, but also because I personally value that in assessing the epistemics of an org that makes charity recommendations to the general public. I also lean in that direction because I worry that setting too many substantive expectations for future reports will unduly suppress the public release of outputs.
I am concerned that HLI is at risk of second-impact syndrome and would not, as a practical matter, survive a set of similar mistakes on the re-analysis of SM or on its next few major recommendations. For that reason, I have not refrained from offering suggestions based on my prediction that they could slow down HLI’s plans to some extent, or incur moderately significant resource costs.
All of these come from someone who wants HLI to succeed. I think we need to move future conversations about HLI in a “where do we go from here” direction rather than spending a lot of time and angst re-litigating the significance and import of previously-disclosed mistakes.[1] I’m sure this thread has already consumed a lot of HLI’s limited time; I certainly do not expect a reply.
A: Messaging Calibration
For each research report, you could score and communicate the depth/thoroughness of the research report, the degree of uncertainty, and the quality of the available evidence. For the former, the scale could be something like 0 = Don’t spend more than $1 of play money on this; 10 = We have zero hesitation with someone committing > $100MM on this without further checking. For the materials you put out (website materials, Forum posts, reports), the material should be consistent with your scores. Even better, you could ask a few outside people to read draft materials (without knowing the scores) and tell you what scores the material implies to them.
I think it’s perfectly OK for an org to put out material that has some scores of 4 or 5 due to resource constraints, deprioritization due to limited room for funding or unpromising results, etc. Given its resources, its scope of work, the areas it is researching, and the state of other work in those areas, I don’t think HLI can realistically aim for scores of 9 or a 10 across the board in the near future. But the messaging needs to match the scores. In fact, I might aim for messaging that is slightly below the scores. I say that because the 2022 Giving Season materials suggest HLI’s messaging “scale” may be off, and adding a tare weight could serve as an interim fix.
I think HLI is in a challenging spot given GiveWell’s influence and resources. I further think that most orgs in HLI’s position would feel a need to “compete” with GiveWell, and that some of the 2022 messaging suggests that may be the case. I think that pressure would put most orgs at risk of projecting more confidence and certainty than the data allow, and so it’s particularly important that orgs facing that kind of pressure carefully calibrate their messaging.
B: Identification of Major Hinges
For each recommendation, there could be a page on major hinges, assumptions, methodological critical points, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. For bonus points, you could code an app that allows the user to see how the results change based on various hinges. For example, for the SM recommendation, I would have liked to see things like the material below. (Note that some examples are based on posted criticisms of the SM CEA, but the details are not meant to be taken literally.)
X% of the projected impact comes from indirect effects on family members (“spillovers”), for which the available research is limited. We estimate that each family member benefits 38% as much as the person receiving services. See pages ___ of our report for more information. Even a moderate change in this estimate could significantly change our estimate of WELLBYs per $1,000 spent.
In estimating the effect of the SM program, we included the results of two studies conducted by StrongMinds of unknown quality. These studies showed significantly better results than most others, and the result of one study is approaching the limits of plausibility. If we had instead decided to give these two studies zero credence in our model, our estimate of SM’s impact would have decreased by Y%.[2] See pages ___ of our report for more information.
We considered 39 studies in estimating SM’s effect size. There was a significantly wider than expected variance in the effects reported by the studies (“heterogenity”), which makes analysis more difficult. About C% of the reported effect is based on five specific studies. Moreover, there were signs that higher quality studies showed lower effects. Although we attempted to correct for these issues, it is possible that we did not fully succeed. We subjectively estimate there is at least a 10% chance that our estimate is at least 20% too high due to these effects. See pages ___ of our report for more information.
The data show a moderately pronounced Gandalf effect. There are two generally accepted ways to address a Gandalf effect. We used a Gondor correction for the reasons described at pages ___ of our report. However, using a Rohan correction would have been a reasonable alternative and would have reduced the estimated impact by 11%.
Presumably you would already know where the hinges and critical values were, so listing them in lay-readable form shouldn’t require too much effort. But doing so protects against people getting the impression that the overall conclusion isn’t appropriately caveated, that you didn’t make it clear enough how much role study A or factor B played, etc. Of course, this section could list positive factors too (e.g., we used the Rohan correction even though it was a close call and the Gondor correction would have boosted impact 11%).
C: Red-Teaming and Technical Appendix
In my field (law), we’re taught that you do not want the court to learn about unfavorable facts or law only from your opponents’ brief. Displaying up front that you saw an issue rules out two possible unfavorable inferences a reader could draw: that you didn’t see the issue, or that you saw the issue and hoped neither the court nor the other side’s lawyer would notice. Likewise, more explicit recognition of certain statistical information in a separate document may be appropriate, especially in an epistemic-probation situation. I do recognize that this could incur some costs.
I’m not a statistican by any means, but to the extent that you would might expect an opposition research team to express significant concern about a finding—such as the pre-registered reports showing much lower effect sizes than the unregistered ones—I think it would be helpful to acknowledge and respond to that concern upfront. I recognize that potentially calls for a degree of mind-reading, and that this approach may not work if the critics dig for more arcane stuff. But even if the critics find something that the red team didn’t, the disclosure of some issues in a technical appendix still legibly communicates a commitment to self-critical analysis.
D: Correction Listing and Policy
For each recommendation, there could be a page for issues, corrections, subsequent developments, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. There could also be a policy that explains what sorts of issues will trigger an entry on that page and the timeframe in which information will be added, as well as trigger criteria for conspiciously marking the recommendation/report as under review, withdrawing it pending further review, and so on. The policy should be in effect for as long as there is a recommendation based on the report, or for a minimum of G years (unless the report and any recommendation are formally withdrawn).
The policy would need to include a definition of materiality and clearly specified claims. Claims could be binary (SM cost-effectiveness > GiveDirectly) or quantitative (SM cost-effectiveness = 7.5X GiveDirectly). A change could be defined as material if it changed the probability of a binary claim more than Y% or changed a quantitative claim more than Z%. It could provide that any new issue will be added to the issues page within A days of discovery unless it is determined that the issue is not reasonably likely (at least Q% chance) to be material. It could provide that there will be a determination of materiality (and updated credences or estimates as necessary) within B days. The policy could describe which website materials, etc. would need to be corrected based on the degree of materiality.
If for some reason the time limit for full adjudication cannot be met, then all references to that claim on HLI’s website, the Forum, etc. need to be clearly marked as [UNDER REVIEW] or pulled so that the reader won’t be potentially mislead by the material. In addition, all materials need to be marked [UNDER REVIEW] if at any time there is a substantial possibility (at least J%) that the claim will ultimately be withdrawn.
This idea is ultimately intended to be about calibration and clear communication. If an org commits, in advance, to certain clear claims and a materiality definition, then the reader can compare those commitments against the organization’s public-facing statements and read them accordingly. For instance, if the headline number is 8X cash, but the org will only commit to following correction procedures if that dips below 4X cash, that tells the reader something valuable.
This is loosely akin to a manufacturer’s warranty, which can be as important as a measure of the manufacturer’s confidence in the product as anything else. I recognize that larger orgs will find it easier to make corrections in a timely manner, and the community needs to give HLI more grace (both in terms of timelines and probably materiality thresholds) than it would give a larger organization.
Likewise, a policy stated in advance provides a better way to measure whether the organization is dealing appropriately with issues versus digging in its heels. It can commit the organization to make concrete adjustments to its claims or to affirm a position that any would-be changes do not meet pre-determined criteria. Hopefully, this would avoid—or at least focus—any disputes about whether the organization is inappropriately maintaining its position. Planting the goalposts in advance also cuts off any disputes about whether the org is moving the goalposts in response to criticism.
[two more speculative paragraphs here!] Finally, the policy could provide for an appeal of certain statistical/methodological issues to a independent non-EA expert panel by a challenger who found the HLI’s application of its correction policy incorrect. Costs would be determined by the panel based on its ruling. HLI would update its materials with any adverse finding, and prominently display any finding by the panel that it had made an unreasonable application under its policy (which is not the same as the panel agreeing with the challenger).
This might be easier to financially justify than a bounty program because it only creates exposure if there is a material error, HLI swings and misses on the opportunity to correct it, and the remaining error is clear enough for a challenger to risk money. I am generally skeptical of “put your own money at risk” elements in EA culture for various reasons, but I don’t think the current means of dispute resolution are working well for either HLI or the community.
I think your last sentence is critical—coming up with ways to improve epistemic practices and legibility is a lot easier where there are no budget constraints! It’s hard for me to assess cost vs. benefit for suggestions, so the suggestions below should be taken with that in mind.
For any of HLI’s donors who currently have it on epistemic probation: Getting out of epistemic probation generally requires additional marginal resources. Thus, it generally isn’t a good idea to reduce funding based on probationary status. That would make about as much sense as “punishing” a student on academic probation by taking away their access to tutoring services they need to improve.
The suggestions below are based on the theory that the main source of probationary status—at least for individuals who would be willing to lift that status in the future—is the confluence of the overstated 2022 communications and some issues with the SM CEA. They lean a bit toward “cleaner and more calibrated public communication” because I’m not a statistican, but also because I personally value that in assessing the epistemics of an org that makes charity recommendations to the general public. I also lean in that direction because I worry that setting too many substantive expectations for future reports will unduly suppress the public release of outputs.
I am concerned that HLI is at risk of second-impact syndrome and would not, as a practical matter, survive a set of similar mistakes on the re-analysis of SM or on its next few major recommendations. For that reason, I have not refrained from offering suggestions based on my prediction that they could slow down HLI’s plans to some extent, or incur moderately significant resource costs.
All of these come from someone who wants HLI to succeed. I think we need to move future conversations about HLI in a “where do we go from here” direction rather than spending a lot of time and angst re-litigating the significance and import of previously-disclosed mistakes.[1] I’m sure this thread has already consumed a lot of HLI’s limited time; I certainly do not expect a reply.
A: Messaging Calibration
For each research report, you could score and communicate the depth/thoroughness of the research report, the degree of uncertainty, and the quality of the available evidence. For the former, the scale could be something like 0 = Don’t spend more than $1 of play money on this; 10 = We have zero hesitation with someone committing > $100MM on this without further checking. For the materials you put out (website materials, Forum posts, reports), the material should be consistent with your scores. Even better, you could ask a few outside people to read draft materials (without knowing the scores) and tell you what scores the material implies to them.
I think it’s perfectly OK for an org to put out material that has some scores of 4 or 5 due to resource constraints, deprioritization due to limited room for funding or unpromising results, etc. Given its resources, its scope of work, the areas it is researching, and the state of other work in those areas, I don’t think HLI can realistically aim for scores of 9 or a 10 across the board in the near future. But the messaging needs to match the scores. In fact, I might aim for messaging that is slightly below the scores. I say that because the 2022 Giving Season materials suggest HLI’s messaging “scale” may be off, and adding a tare weight could serve as an interim fix.
I think HLI is in a challenging spot given GiveWell’s influence and resources. I further think that most orgs in HLI’s position would feel a need to “compete” with GiveWell, and that some of the 2022 messaging suggests that may be the case. I think that pressure would put most orgs at risk of projecting more confidence and certainty than the data allow, and so it’s particularly important that orgs facing that kind of pressure carefully calibrate their messaging.
B: Identification of Major Hinges
For each recommendation, there could be a page on major hinges, assumptions, methodological critical points, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. For bonus points, you could code an app that allows the user to see how the results change based on various hinges. For example, for the SM recommendation, I would have liked to see things like the material below. (Note that some examples are based on posted criticisms of the SM CEA, but the details are not meant to be taken literally.)
X% of the projected impact comes from indirect effects on family members (“spillovers”), for which the available research is limited. We estimate that each family member benefits 38% as much as the person receiving services. See pages ___ of our report for more information. Even a moderate change in this estimate could significantly change our estimate of WELLBYs per $1,000 spent.
In estimating the effect of the SM program, we included the results of two studies conducted by StrongMinds of unknown quality. These studies showed significantly better results than most others, and the result of one study is approaching the limits of plausibility. If we had instead decided to give these two studies zero credence in our model, our estimate of SM’s impact would have decreased by Y%.[2] See pages ___ of our report for more information.
We considered 39 studies in estimating SM’s effect size. There was a significantly wider than expected variance in the effects reported by the studies (“heterogenity”), which makes analysis more difficult. About C% of the reported effect is based on five specific studies. Moreover, there were signs that higher quality studies showed lower effects. Although we attempted to correct for these issues, it is possible that we did not fully succeed. We subjectively estimate there is at least a 10% chance that our estimate is at least 20% too high due to these effects. See pages ___ of our report for more information.
The data show a moderately pronounced Gandalf effect. There are two generally accepted ways to address a Gandalf effect. We used a Gondor correction for the reasons described at pages ___ of our report. However, using a Rohan correction would have been a reasonable alternative and would have reduced the estimated impact by 11%.
Presumably you would already know where the hinges and critical values were, so listing them in lay-readable form shouldn’t require too much effort. But doing so protects against people getting the impression that the overall conclusion isn’t appropriately caveated, that you didn’t make it clear enough how much role study A or factor B played, etc. Of course, this section could list positive factors too (e.g., we used the Rohan correction even though it was a close call and the Gondor correction would have boosted impact 11%).
C: Red-Teaming and Technical Appendix
In my field (law), we’re taught that you do not want the court to learn about unfavorable facts or law only from your opponents’ brief. Displaying up front that you saw an issue rules out two possible unfavorable inferences a reader could draw: that you didn’t see the issue, or that you saw the issue and hoped neither the court nor the other side’s lawyer would notice. Likewise, more explicit recognition of certain statistical information in a separate document may be appropriate, especially in an epistemic-probation situation. I do recognize that this could incur some costs.
I’m not a statistican by any means, but to the extent that you would might expect an opposition research team to express significant concern about a finding—such as the pre-registered reports showing much lower effect sizes than the unregistered ones—I think it would be helpful to acknowledge and respond to that concern upfront. I recognize that potentially calls for a degree of mind-reading, and that this approach may not work if the critics dig for more arcane stuff. But even if the critics find something that the red team didn’t, the disclosure of some issues in a technical appendix still legibly communicates a commitment to self-critical analysis.
D: Correction Listing and Policy
For each recommendation, there could be a page for issues, corrections, subsequent developments, and the like. It should be legible to well-educated generalists, and there should be a link to this page on the main recommendation page, in Forum posts, etc. There could also be a policy that explains what sorts of issues will trigger an entry on that page and the timeframe in which information will be added, as well as trigger criteria for conspiciously marking the recommendation/report as under review, withdrawing it pending further review, and so on. The policy should be in effect for as long as there is a recommendation based on the report, or for a minimum of G years (unless the report and any recommendation are formally withdrawn).
The policy would need to include a definition of materiality and clearly specified claims. Claims could be binary (SM cost-effectiveness > GiveDirectly) or quantitative (SM cost-effectiveness = 7.5X GiveDirectly). A change could be defined as material if it changed the probability of a binary claim more than Y% or changed a quantitative claim more than Z%. It could provide that any new issue will be added to the issues page within A days of discovery unless it is determined that the issue is not reasonably likely (at least Q% chance) to be material. It could provide that there will be a determination of materiality (and updated credences or estimates as necessary) within B days. The policy could describe which website materials, etc. would need to be corrected based on the degree of materiality.
If for some reason the time limit for full adjudication cannot be met, then all references to that claim on HLI’s website, the Forum, etc. need to be clearly marked as [UNDER REVIEW] or pulled so that the reader won’t be potentially mislead by the material. In addition, all materials need to be marked [UNDER REVIEW] if at any time there is a substantial possibility (at least J%) that the claim will ultimately be withdrawn.
This idea is ultimately intended to be about calibration and clear communication. If an org commits, in advance, to certain clear claims and a materiality definition, then the reader can compare those commitments against the organization’s public-facing statements and read them accordingly. For instance, if the headline number is 8X cash, but the org will only commit to following correction procedures if that dips below 4X cash, that tells the reader something valuable.
This is loosely akin to a manufacturer’s warranty, which can be as important as a measure of the manufacturer’s confidence in the product as anything else. I recognize that larger orgs will find it easier to make corrections in a timely manner, and the community needs to give HLI more grace (both in terms of timelines and probably materiality thresholds) than it would give a larger organization.
Likewise, a policy stated in advance provides a better way to measure whether the organization is dealing appropriately with issues versus digging in its heels. It can commit the organization to make concrete adjustments to its claims or to affirm a position that any would-be changes do not meet pre-determined criteria. Hopefully, this would avoid—or at least focus—any disputes about whether the organization is inappropriately maintaining its position. Planting the goalposts in advance also cuts off any disputes about whether the org is moving the goalposts in response to criticism.
[two more speculative paragraphs here!] Finally, the policy could provide for an appeal of certain statistical/methodological issues to a independent non-EA expert panel by a challenger who found the HLI’s application of its correction policy incorrect. Costs would be determined by the panel based on its ruling. HLI would update its materials with any adverse finding, and prominently display any finding by the panel that it had made an unreasonable application under its policy (which is not the same as the panel agreeing with the challenger).
This might be easier to financially justify than a bounty program because it only creates exposure if there is a material error, HLI swings and misses on the opportunity to correct it, and the remaining error is clear enough for a challenger to risk money. I am generally skeptical of “put your own money at risk” elements in EA culture for various reasons, but I don’t think the current means of dispute resolution are working well for either HLI or the community.
This is not meant to discourage discussions of any new issues with the recommendation or underlying analysis that may be found.
I think this is the fairest way to report this—because the studies were outliers, they may have been hingier than their level of credence.
This was really helpful, thanks! I’ll discuss it with the team.