How we failed

Here “we” means the broader EA and rationalist communities.


Learning from failures is just as important as learning from successes. This post describes a few cases of mistakes and failures which seem interesting from the learning perspective.

Failure to improve US policy in the first wave

Early in the pandemic, the rationality community was right about important things (e.g. masks), often very early on. We won the epistemic fight, and the personal instrumental fight. (Consider our masks discourse, or Microcovid use, or rapid testing to see our friends.)

At the same time, the network distance from the core of the community to e.g. people in the CDC is not that large: OpenPhil is a major funder for the Johns Hopkins Center for Health Security, and the network distance from CHS to top public health institutions is presumably small. As such, we could say the US rationality community failed instrumentally, given their apparent short distance to influence.

Contrast this with the influence of LessWrong on UK policy via a single reader, Dominic Cummings. Or our influence on Czech policy (see What we tried).

Also, contrast this with the #masks4all movement. This movement was founded by the writer and speaker Petr Ludwig (pre COVID, he was developing something like his own version of rationality and critical thinking, independent of the LessWrong cluster). After the success of grass-roots activity in DIY mask-making in Czechia, which led to the whole country starting to use the masks within a week or so, he tried to export the “masks work” and “you can make them at home” meme globally. While counterfactuals are hard, this seems a major success, likely speeding up mask uptake in Western countries by weeks (consider the Marginal Revolution praise​​).

Where was the “microcovid calculator for countries”? (see EpiFor funding and medium-range models.)

Overreaction through 2021

Personal reactions within the community in February 2020 were sometimes exemplary; apparent overshoots (like copper tape on surfaces, postal package quarantines) were reasonable ex ante.

But the community was slow to update its behaviour in response to improved estimates of infection-fatality ratee, long COVID, and knowledge of aerosol and droplet transmission. Anecdotally, the community was slow to return to relatively safe things like outdoor activities, even after full vaccination.

While your risk tolerance is a given quantity in decision-making, my impression is that many people’s behaviour did not match the risk tolerance implicit in their other daily habits.

Inverse gullibility

Many large institutions had a bad track record during the pandemic, and the heuristic “do not update much from their announcements” served many people well. However, in places the community went beyond this, to non-updates from credible sources and anti-updates from institutions.

Gullibility is believing things excessively: taking the claim “studies show” as certain. Inverse gullibility is disbelieving things excessively: failing to update at all when warranted, or even to invert the update.

Example: the COVID research summaries of the FDA are often good; it’s in the policy guidance section that the wheels come off. But I often see people write off all FDA material.

My broad guess is that people who run off simple models like “governments are evil and lie to you” are very unlikely to be able to model governments well, and very unlikely to understand parts of the solution space where governments can do important and useful things.

Failing to persuade reviewers that important methods were valid

More locally to our research: we used skilled forecasters for many of our estimates, as an input for our hybrid of mathematical and judgmental prediction. But relying on forecasters is often frowned upon in academic settings. Stylized dialogue:

Reviewer: “How did you get the parameters?”

EpiFor: “We asked a bunch of people who are good at guessing numbers to tell us!”

Reviewer: “That’s unrigorous, remove it.”

Similarly: the foremost point of doing NPI (non-pharmaceutical intervention; technical term for interventions including various ‘mandates’ and orders) research was to provide governments with cost-benefit estimates. In this Forum I don’t need to explain why. But academic studies of COVID policies usually lack estimates of the social cost, and instead merely estimate the transmission reduction. Why?

A clue: Our initial NPI preprint included an online survey about relative lockdown preferences, e.g. the self-reported disutility of bars closing for 2 weeks vs. a mask mandate being imposed for 2 months. This allowed us to actually give policy-relevant estimates of the cost (disutility) of NPIs, compared to the benefit (reduction of virus transmission).

It proved hard to get this version published; the apparent subjectivity of the costs, the inclusion of economic methods in an epidemiology paper, and the specific choice of preference elicitation methods, etc, all exposed a large “attack surface” for reviewers. In the end, we just removed the cost-benefit analysis.

Clearly, internal documents of at least some governments will have estimated these costs. But in almost all cases these were not made public. Even then: as far as we know, only economic costs were counted in these private analyses; it is still rare to see estimates of the large direct disutility of lockdown.

EpiFor funding and medium-range models

(Conflict of interest: obvious.)

Epidemic Forecasting was initially funded by Tim Telleen-Lawton, which made a lot of the work we did possible.

Epidemic Forecasting could have been significantly more impactful if someone had given us between $100k − 500k in June 2020, on the basis that we would do interesting things, even though it was hard to explain and justify from an ITN perspective in advance.

Around June 2020, we had all the ingredients which would allow us to upload reasonable models forecasting the arrival of the 2nd wave: the best NPI estimates, the best seasonality estimates, forecasters to predict the triggers that lead to the adoption of NPIs, a framework to put it together, and a team of modellers and developers able to turn it into a model. This would likely have been the best medium-range forecast dashboard for Western countries, and provided advance warning about future waves ~3 months in advance. We even secured funding from a large non-EA funder. (!)

Both our host university and this funder displayed a great deal of inflexibility. My institution, Oxford, was unable to accept the funding in a way which would allow us to contract superforecasters in time. Further, when we tried to accept the money via a charity instead, we found that the funder was itself unable to fund non-vetted charities—and that their vetting process would be too long and costly for this “small” a donation ($100k).

No EA funders were willing to step in (despite the project having arguably better impact than any of the COVID projects funded by OpenPhil, and this impact being well-evidenced in the policy effects of the first-wave dashboard). My personal update from this is roughly “while the EA funding infrastructure is much less inadequate than the rest of the world, it is still very far from a situation where I can model it as sensible and reliable”.

IHME forecasts

Taking data-driven charity as the ingroup (and so part of how “we” failed), I note that something was up at the Gates Foundation. The GF has committed at least $400 million to the Institute for Health Metrics and Evaluation, an organisation with a notorious record on COVID.

So what? Well, a rigorous donor with strong vetting empowered a team with apparently deep epistemic problems, and this likely seriously misled policy. (The IHME forecasts were among the most influential in the first wave, second only to Imperial College London.) And this failure occurred despite GF’s empirical, institutional, non-hits-based approach. A systematic review of such failures could produce a useful prior for this “safe” approach.

(I should say that I admire the IHME’s Global Burden of Disease project, and do not mean to impugn the whole organisation. The above is troubling precisely because it drastically underperforms high expectations from IHME, based on their existing track record.)

Ask me anything

Related ‘ask me anything’ session is happening on Thu 24th under the What we tried post.

While this text is written from my (Jan Kulveit’s) personal perspective, I co-wrote the text with Gavin Leech, with input from many others.