Against cash benchmarking for global development RCTs
Should you fund an expensive program to help people or just send them cash? Using a randomised controlled trial (RCT) to directly compare international development programs to cash transfers was the Cool New Thing a few years back, with Vox calling it a “radical” idea that could be a “sea change in the way that we think about funding development”. I spent 2 years on an RCT of a program that wanted to be radical and sea-changey, so naturally we considered using a cash comparison arm.
We didn’t use one. Throughout the process my mind was changed from “cash comparison arms are awesome” to “cash comparison arms probably rarely make much sense”. This change of perspective surprised me, so I wanted to think out loud here about why this happened.
The program I was evaluating
Asset transfer programs, such as giving people goats or fertiliser, have been found to work well at reducing poverty, but they also cost a lot, like over $1,000 per household. They also often have high staff costs from training and are difficult to massively scale. We wanted to try a super cheap asset transfer program that cost about $100 per household with minimal staffing, to see if we could still achieve meaningful impacts. We designed this as an RCT with 2,000 households in rural Tanzania and we’ll hopefully have a paper out this year. We gave treatment households a bundle of goods including maize fertiliser, seed, chicks, mosquito bed nets, and a load of other things.
Why someone might suggest a cash arm
Much smarter people than me are in favour of directly comparing development programs to cash. There are arguments for it here and here. My favourite argument comes from a financial markets analogy: imagine you are considering investing in a fund. “We consistently make more money than we lose” is… good to hear. But much better would be, “We consistently beat the market”. That’s the role that a cash arm plays: rather than just check if a program is better than doing nothing at all (comparing to a control), we index it against a simple intervention that we know works well: cash. The fairest, most direct way to do this is to simply add an extra arm to your RCT, comparing treatment, control, and cash arms.
Why we wanted a cash arm at first
We really wanted to be compared to cash. Being able to say “give us $100 and we’ll do as much good as a $300 cash transfer” would be a powerful donor pitch. We were pretty confident that our program was better than cash, too. We bought our products at large scale on global markets, which meant our money went way further than our recipients’ could and we had access to quality products they couldn’t buy locally.
We even ran a small cash trial with 40 households. They all spent the cash well (mostly on home repairs) but no one seemed able to find investment opportunities as good as the ones in our asset bundle. When we eventually told the cash arm participants that we had given other households assets of the same value, most said they would have preferred the assets, “We don’t have good products to buy here”. We had also originally planned to work in 2 countries but ended up working in just 1, freeing up enough budget to pay for cash.
How my mind changed on cash
In short: Different programs will have impacts over different horizons, so the timing of when you collect your impact measurements will heavily skew whether cash or your program looks better.
Cash impact hits roughly immediately after distribution as households start to spend it
Our program’s impact took much longer to hit:
Our program included chicks that wouldn’t lay eggs or be eaten until they were 6 months old
We also gave maize inputs that would generate income only at harvest time, 9 months after distribution
Our tree seedlings would take years to grow and produce sellable timber or firewood
Household consumption surveys only give a short snapshot of household wellbeing:
Surveys generally only cover household consumption over the past 1-4 weeks
Imagine trying to remember how much you spent on groceries, rent and electricity the week of the 2nd of January and you’ll see why these surveys can only go so far back
Survey timing would then end up as the key driver of which program looked better:
If we timed our survey to be 3 months after distribution, we would likely see pretty good cash impact and almost no program impact.
If we timed the survey to be after 9 months, immediately after harvest, we would see massive program impact, picking up the big influx of income right at harvest, while cash impact will have started to fade away.
Conducting surveys every single month is one solution, but it’s impossibly expensive for most studies. So when is the right time to survey? There probably wasn’t one for our study, not one that made sense for both arms, and there likely wouldn’t be for many other interventions either. We risked massively overstating or understating the impact of our program relative to cash based on when we happened to time our surveys, rather than any true difference in the benefits of either intervention. To continue the financial analogy: it’s like comparing an investment portfolio to the performance of the market but you can only use data taken from one specific day. Eventually I concluded that we simply couldn’t trust any final numbers.
I am most familiar with our own program but I expect this applies to many other international development programs too: your medicine/training/infrastructure/etc program will very likely deliver benefits over a different timeline to cash, making a direct RCT comparison dependent more on survey timing than intervention efficacy.
So is there any case for cash benchmarking?
Honestly, I’m not sure there is, at least not in an RCT arm, but I’m open to being wrong. In the Vox article at the start of this post, the cash-benchmarked nutrition program failed to improve nutrition at all (cash failed too). You shouldn’t need a cash comparison arm to conclude that that program either needs major improvements or defunding. So even in the case presented by supporters, I don’t see the value of cash benchmarking in RCTs.
I can see a case for cash benchmarking as a way of modelling program impact, rather than as an RCT arm. For example, if a non-profit spent $2,000/household delivering $300 of agricultural equipment (it happens), a pretty simple financial model in Excel should tell you the required rate of return needed on the equipment to be more efficient than just giving $2,000 cash. Then a trial or literature review should tell you if your program can meet that rate of return or not. If you can’t beat cash, then you might want to just give cash.
GiveWell uses cash as a benchmark for health interventions in a way that seems sensible: using “moral weights” to convert lives saved into equivalent financial benefits for poor households. Because they are comparing separate studies that were designed to get the most accurate measure for their specific intervention, this conversion approach seems fair in principle, although I’m not familiar enough with the particular studies they use. They can then compare the value of doubling a person’s consumption for a year to the value of saving a life in a way that wouldn’t really make sense in a single RCT.
What I learned from this experience
My views on cash benchmarking changed entirely when I actually tried to write down a hypothetical results table, forcing me to clarify when our results would be collected. This has been a useful technique to take forward to future projects: I find that actually sketching out potential table of results, ideally under a few scenarios, really sharpens my thinking.
Some reasons why I might be wrong
There might be programs that have similar impact timelines to cash, making comparisons fair
There might be some important impact measures where there is a clear optimal measurement time e.g. “Not pregnant before age 16”, “Votes in the 2025 election”
There might be use cases for cash-benchmarking RCTs other than “this program was 2x better than cash’ that I haven’t considered
Cash benchmarking might make sense over very long timelines (e.g. 5-10+ years), at which point the differences in initial impact timing might be washed out
… and maybe many more. This is my first Forum post, I’m very open to feedback on what I’ve missed!