One of the key takeaways in the body of the text which perhaps I should have brought out more in the summary is that the GiveWell model is basically as reliable as highly professionalised bodies like pharma companies have figured out how to make a cost-effectiveness model. A small number of minor errors are unexceptional for a model of this complexity, even models that we submit to pharma regulators that have had several million dollars of development behind them.
I would say that while the errors are uninteresting and unexceptional, the unusual model design decisions are worth commenting on. The GiveWell team are admirably transparent with their model, and anybody who wants to review it can have access to almost everything at the click of a button (some assumptions are gated to GiveWell staff, but these aren’t central). Given this, it is remarkable the EA community didn’t manage to surface anyone who knew enough about models to flag to GiveWell that there were optimisations in design to be made—the essay above is not really arcane modelling lore but rather something anyone with a few years’ experience in pharma HEOR could have told you. Is this because there are too few quant actors in the EA space? Is it because they don’t think their contributions would be valued so don’t speak up? Is it because criticism of GiveWell makes you unemployable in EA spaces so is heavily incentivised against? Etc etc. That is to say—I think asking why GiveWell missed the improvements is missing the important point, which is that everyone missed these improvements so there’s probably changes that can be made to expert knowledge synthesis in EA right across the board.
Just to add that I think outreach efforts like the Red Team contest are a really good way of doing this—I wouldn’t have heard about the EA Forums had it not been for the plug Scott Alexander gave the contest on Astral Codex Ten (which I read mostly for the stuff on prediction markets).
Ah sorry, I think I might have confused the issue a bit with my footnote. I think I’ve managed to conflate two issues in your mind.
The first is exactly as you say; any intervention worth doing has some effects which are easy to model and some which are difficult (maybe impossible) to model. What GiveWell has done is completely reasonable here; modelling what it can and then making assumptions about how important the other things, like track record, are in comparison to the main cost-effectiveness results.
The second issue is the more subtle one that I was driving at. Imagine you are going to buy a new car, and your friend (who knows about cars) says that modern cars are 10x more fuel efficient than the car you currently drive. Speaking very roughly, there are two strategies you could pick from to choose your next car:
Completely ignore your friend, and pick the car that has the best MPG regardless of any other feature. This would be a good strategy if literally all you care about is fuel efficiency, but a bad strategy otherwise (because it is unlikely the most fuel efficient car is also the most comfortable to drive—especially if fuel efficiency and comfort are sort-of tradeoffs)
Treat your friend as having offered a useful rule of thumb, and so have an idea in your head about what ‘good’ fuel efficiency looks like. This is a good strategy if cars aren’t really directly comparable along a straightforward scale—a Ford F-150 isn’t ‘better’ or ‘worse’ than a Prius, it is just a different kind of thing.
Both GiveWell (implicitly) and me in my fertility days (explicitly) argue that QALYs are like cars—you can end up in a situation where you can generate different kinds of QALYs and your best bet is to compare them with a rule of thumb like GiveWell’s 10x multiplier. However I don’t think GiveWell is correct in making this assumption about charities—there is in fact a single measure like MPG which we want to ruthlessly optimise, and therefore we do actually want to it the F-150 and Prius directly against each other.
However my point in the essay is that GiveWell don’t actually have to choose—they can build their model as if they are in the first world and directly compare charities together, and then make their final decision as though they are in the second world and different charities will offer different profiles of benefit on top of their cost-effectiveness. This is pretty much the commonsense way of choosing a car too—you would look at MPG and directly compare cars in this way, but you might then consider other factors. It would be weird to lump all cars together in your head as ‘better than 10x my previous efficiency’ or ‘worse than 10x my previous efficiency’.