Many (many!) charities are too small to measure their own impact

Link post

Why I crossposted this: I found it to be an interesting perspective on a not-uncommon assumption within EA (that every charity should be conducting self-evaluation), written by someone with a lot of field experience in charity consulting/evaluation (who is, I think, mostly aligned with EA’s views on the importance of effectiveness).

I don’t necessarily endorse any of this, but reasons 2-4 all seem like good reasons to at least not pursue certain types of self-evaluation. Reason 1 seems like a different kind of problem, but still rings true and points to the need for funders to be strict about what sorts of research they care about, or otherwise aim at changing the incentives in place.

Most charities should not evaluate their own impact. Funders should stop asking them to evaluate themselves. For one thing, asking somebody to mark their own homework was never likely to be a good idea.

This article explains the four very good reasons that most charities should not evaluate themselves, and gives new data about how many of them are too small.

Most operational charities should not (be asked to) evaluate themselves because:

1. They have the wrong incentive.

Their incentive is (obviously!) to make themselves look as great as possible – evaluations are used to compete for funding – so their incentive is to rig the research to make it flattering and/or bury research that doesn’t flatter them. I say this having been a charity CEO myself and done both.

2. They lack the necessary skills in evaluation.

Most operational charities are specialists in, say, supporting victims of domestic violence or delivering first aid training or distributing cash in refugee camps. These are completely different skills to doing causal research, and one would not expect expertise in these unrelated skills to be co-located.

3. They often lack the funding to do evaluation research properly.

One major problem is that a good experimental evaluation may involve gathering data about a control group which does not get the programme or which gets a different programme, and few operational charities have access to such a set of people.

A good guide is a mantra from evidence-based medicine, that research should “ask an important question and answer it reliably”. If there not enough money (or sample size) to answer the question reliably, don’t try to answer it at all.

4. They’re too small.

Specifically, their programmes are too small: they do not have enough sample size for evaluations of just their programmes to produce statistically meaningful results, i.e., to distinguish the effects of the programme from that of other factors or random chance, i.e., results of self-evaluations by operational charities are quite likely to be just wrong. For example, when the Institute of Fiscal Studies did a rigorous study of the effects of breakfast clubs, it needed 106 schools in the sample: that is way more than most operational charities providing breakfast clubs have.

Giving Evidence has done some proper analysis to corroborate this view that many operational charities’ programmes are too small to reliably evaluate. The UK Ministry of Justice runs a ‘Data Lab’, which any organisation running a programme to reduce re-offending can ask to evaluate that programme: the Justice Data Lab uses the MoJ’s data to compare the re-offending behaviour of participants in the programme with that of a similar (‘propensity score-matched’) set of non-participants. It’s glorious because, for one thing, it shows loads of charities’ programmes all evaluated in the same way, on the same metric (12-month reoffending rate) by the same independent researchers. It is the sole such dataset of which we are aware, anywhere in the world.

In the most recent data (all its analyses up to October 2020), the JDL had analysed 104 programmes run by charities (‘the voluntary and community sector’), of which fully 62 prove too small to produce conclusive results. 60% of the charity-run programmes were too small to evaluate reliably.

The analyses also show the case for reliable intervention and not just guessing which charity-run programmes work or assuming that they all do:

a. Some charity-run programmes create harm: they increase reoffending, and

b. Charity-run programmes vary massively in how effective they are:

Hence most charities should not be PRODUCERS of research. But they should be USERS of rigorous, independent research – about where the problems are, why, what works to solve them, and who is doing what about them.