Make RCTs cheaper: smaller treatment, bigger control groups
Epistemic status: I think this is a statistical “fact” but I feel a bit cautious since so few people seem to take advantage of it
Summary
It may not always be optimal for cost or statistical power to have equal-sized treatment/control groups in a study. When your intervention is quite expensive relative to data collection, you can maximise statistical power or save costs by using a larger control group and smaller treatment group. The optimal ratio of treatment sample to control sample is just the square root of the cost per treatment participant divided by the square root of the cost per control participant.
Why larger control groups seem better
Studies generally have equal numbers of treatment and control participants. This makes intuitive sense: a study with 500 treatment and 500 control will be more powerful than a study with 499 treatment and 501 control, for example. This is due to the diminishing power returns to increasing your sample size: the extra person removed from one arm hurts your power more than the extra person added to the other arm increases it.
But what if your intervention is expensive relative to data collection? Perhaps you are studying a $720 cash transfer and it costs $80 to complete each survey, for a total cost of $800 per treatment participant ($720 + $80) and $80 per control. Now, for the same cost as 500 treatment and 500 control, you could have 499 treatment and 510 control, or 450 treatment and 1000 control: up to a point, the loss in precision from the smaller treatment is more than offset by the 10x larger increase in your control group, resulting in a more powerful study overall. In other words: when your treatment is expensive, it is generally more powerful to have a larger control group, because it’s just so much cheaper to add control participants.
How much larger? The exact ratio of treatment:control that optimises statistical power is surprisingly simple, it’s just the ratio of the square roots of the costs of adding to each arm i.e. sqrt(control_cost) : sqrt(treatment_cost) (See Appendix for justification). For example, if adding an extra treatment participant costs 16x more than adding a control participant, you should optimally have sqrt(16/1) = 4x as many control as treatment.
Quantifying the benefits
With this approach, you either get free extra power for the same money or save money without losing power. For example, let’s look at the hypothetical cash transfer study above with treatment participants costing $800 and control participants $80. The optimal ratio of control to treatment is then sqrt(800/80) = 3.2 :1, resulting in either:
Saving money without losing power: the study is currently powered to measure an effect of 0.175 SD and, with 500 treatment and control, costs $440,000. With a 3.2 : 1 ratio (*types furiously in Stata*) you could achieve the same power with a sample of 337 treatment and 1079 control, which would cost $356,000: saving you a cool $84k without any loss of statistical power.
Getting extra power for the same budget: alternatively, if you still want to spend the full $440k, you could then afford 416 treatment and 1,331 control, cutting your detectable effect from 0.175 SD to 0.155 SD at no extra cost.
Caveats
Ethics: there may be ethical reasons for not wanting a larger control group, for example in a medical trial where you would be denying potentially life-saving treatments to sick patients. Even outside of medicine, control participants’ time is important and you may wish to avoid “wasting” it on participating in your study (although you could use some of the savings to compensate control participants, if that won’t mess with your study).
Necessarily limited samples: obviously if there is a practical limit to increasing your control group size, such as only being able to operate in a limited geography, this may not be an option.
Natural skepticism? This isn’t a common technique, you might just trust that the market for ideas is efficient and if this really was a thing you would have heard about it from somewhere else by now. It kind of blows my mind that this isn’t done more often, which both makes me want to tell people about it and be skeptical. We used this approach for a pretty large RCT I worked on in Tanzania, and no one complained.
Conclusion
If you treatment is quite expensive relative to data collection costs, consider using a larger control group in the ratio of sqrt(treatment_cost/control_cost) and enjoy that spare money or additional statistical power.
Appendix
I am not claiming to have discovered this myself. I first read this equation in Running Randomized Evaluations and was able to derive the same result myself here.
I believe this holds for cluster RCTs, just remember that the increased control sample here would come in the form of additional control clusters, rather than larger clusters.
If you are doing power calculations in Stata and want to factor in different treatment/control group sizes, you just add ratio(X) to the sampsi command, where “X” is the treatment/control ratio. For a cluster RCT using clustersampsi you… need to do something involving harmonic means, I forget exactly, but poke me on the Forum and I’ll happily dig through some old code.
The idea makes a lot of sense, but my guess is that the circumstance where the cost is driven by the intervention itself isn’t that common: In the context of charities, we’re thinking about applying RCTs to test whether an intervention works. Generally the intervention is happening anyway. The cost of RCTs then doesn’t come from applying the intervention to the treatment group—it comes from establishing the experimental conditions where you have a randomised group of participants and the ability to collect data on them.
Hey Aidan—that’s a good point. I think it will probably apply to different extents for different cases, but probably not to all cases. Some scenarios I can imagine:
A charity uses its own funds to run an RCT of a program it already runs at scale:
In this case, you are right that treatment is happening “anyway” and in a sense the $ saved in having a smaller treatment group will just end up being spent on more “treatment”, just not in the RCT.
Even in this case I think the charity would prefer to fund its intervention in a non-RCT context: providing an intervention in an RCT context is inherently costlier than doing it under more normal circumstances, for example if you are delivering assets, your trucks have to drive past control villages to get to treatment ones, increasing delivery costs.
That’s pretty small though, I agree that otherwise the intervention is basically “already happening” and the effective savings are smaller than implied in my post
That said, if the charity has good reason to think their intervention works and so spending more on treatment is “good”, the value of the RCT in the first place seems lower to me
2) A charity uses its own funds to run an RCT of a trial program it doesn’t operate at scale:
In this case, the charity is running the RCT because it isn’t sure the intervention is a good one
Reducing the RCT treatment group frees up funds for the charity to spend on the programs that it does know work, with overall higher EV
3) A donor wants to fund RCTs to generate more evidence:
The donor is funding the RCT because they aren’t sure the intervention works
Keeping RCT costs lower means they can fund more RCTs, or more proven interventions
4) A charity applies for donor funds for an RCT of a new program:
In this case, the cheaper study is more likely to get funded, so the larger control/smaller treatment is a better option for the charity
Overall, I think cases 2/3/4 benefit from the cheaper study. Scenario 1 seems more like what you have in mind and is a good point, I just think there will be enough scenarios where the cheaper trial is useful, and in those cases the charity might consider this treatment/control optimisation.
Thanks Rory - I think your general idea is good, and in some cases could be a good option!
I could be wrong, but from my experience working in the development world these 4 scenarios aren’t really how RCTs generally happen. Usually there will be a partnership with a RCT running NGO (like IPA) or a university department (J-PAL at MIT) where the partner organisation pays for and organise everything.
Sometimes scenario 4 could happen as part of a grant application
This doesn’t change the existence of a budget constraint, though. The partner organization, especially a grant funder like JPAL/IPA, will grant you a certain amount of their resources to use. I don’t see why you wouldn’t want to optimize the use of their resources.
100% the original post stands, in any scenario we would want to optimise use of resources. I don’t think JPAL/IPA is generally a funder though—they do the research themselves so they are the ones to convince ;).
Ah, that’s helpful data. My experience in RCTs mostly comes from One Acre Fund, where we ran lots of RCTs internally on experimental programs, or just A/B tests, but that might not be very typical!
Would be super interested to see the results of some of these RCTs / AB tests. Were any of them published apart from the Lime SMS study? We’re looking for great examples of learning orgs that do this and some studies from 1AF would be a great motivator/example.
Great suggestion, particularly as you say for trials with a super expensive treatment relative to control.
In defense of current practice, I’d like to add that a major difficulty when running medical trials for new therapeutics is simply recruiting patients to the trial. Many patients enroll on the trial with the aim of getting the experimental treatment, so it’s a lot easier to recruit people when your trial has a 50% or 75% chance of assignment to therapeutic arm.
Some other important strategies that are currently hot right now:
Platform trials: One giant trial that has one control arm and maybe three to four treatment arms. Hard to do as it requires a lot of people to work together but amazing when you pull them off (e.g. we did many of these for COVID)
Use of historical or shared control data: Why recruit as many controls if you can integrate existing data in a statistically principled, unbiased way (easier said than done of course).
This is a really helpful post—thank you! It does blow my mind slightly that this isn’t more broadly practiced, if the argument holds, but I think it holds!
I don’t know enough about the market for academic papers, but I wonder if you’d be interested in writing this up for a more academic audience? You could look at some set of recent RCTs and estimate the potential savings (or, more ambitiously, the increase in power and associated improvement in detecting results)
Given that the argument is statistical rather than practical in any way that is specific to economics or development, do you know if this happens in biomedicine? Many trials often involve pitting newer, more expensive interventions against an existing standard of care.
Thanks Chris, that’s a cool idea. I will give it a go (in a few days, I have an EAG to recover from...)
One thing I should note is that other comments on this post are suggesting this is well known and applied, which doesn’t knock the idea but would reduce the value of doing more promotion. Conversely, my super quick, low-N look into cash RCTs (in my reply below to David Reinstein) suggests it is not so common. Since the approach you suggest would partly involve listing a bunch of RCTs and their treatment/control sizes (so we can see whether they are cost-optimised), it could also serve as a nice check of just how often this adjustment is/isn’t applied in RCTs
For bio, that’s way outside of my field, I defer to Joshua’s comment here on limited participant numbers, which makes sense. Though in a situation like early COVID vaccine trials, where perhaps you had limited treatment doses and potentially lots of willing volunteers, perhaps it would be more applicable? I guess pharma companies are heavily incentivised to optimise trial costs tho, if they don’t do it there’ll be a reason!
Often recruiting is the bottleneck in biomedicine so you want to maximise the power for a given number of participants
You’re completely correct! However, it’s worth noting this is standard practice (when the treatment makes up most of the cost, which it usually doesn’t). Most statisticians will be able to tell you about this.
So I think I have two comments:
It’s actually pretty neat you figured this out by yourself, and shows you have a decent intuition for the subject.
However, if you’re a researcher at any kind of research institution, and you run or design RCTs, this suggests an organizational problem. You’re reinventing the wheel, and need to consult with a statistician. It’s very, very difficult to do good research without a statistician, no matter how clever you are. (If you’d like, I’m happy to help if you send me a DM.)
Actually, maybe I should clarify this. This is standard practice when you hire a decent statistician. We’ve known this since like… the 1940s, maybe?
But a lot of organizations and clinical trials don’t do this because they don’t consult with a statistician early enough. I’ve had people come to me and say “hey, here’s a pile of data, can you calculate a p-value?” too many times to count. Yes, I calculated a p-value, it’s like 0.06, and if you’d come to me at the start of the experiment we could’ve avoided the million-dollar boondoggle
that you just created.
I assumed more people were aware of this. I’m using it in a trial we’re about to start. But as others have said, in many trials the treatment is not particularly more costly. But probably a factor in detailed interventions in poverty and health in poor countries. Have you looked into how many studies in development economics and GH&D with costly interventions do this?
As a quick data point I just checked the 6 RCTs GiveDirectly list on their website. I figure cash is pretty expensive so it’s the kind of intervention where this makes sense.
It looks like most cash studies, certainly with just 1 treatment arm, aren’t optimising for cost:
AGAINST CASH: EVIDENCE FROM RWANDA
100 (cash)
farming communities in Uganda
USAID Workforce Readiness Program
762 cash
203 cash + NGO
experimental evidence from Kenya
80 shortterm UBI
71 lump sum
Suggests either 1) there’s some value in sharing this idea more or 2) there’s a good reason these economists aren’t making this adjustment. Someone on Twitter suggested “problems caused by unbalanced samples and heteroskedasticity” but that was beyond my poor epidemiologist’s understanding and they didn’t clarify further.
The “problems caused by unbalanced samples” doesn’t seem coherent to me; I’m not sure what they are talking about.
If the underlying variance is different between the treatment and the control group:
That might justify a larger sample for the group with larger variance
But I would expect the expected variance to tend to be larger for the treatment group in many/most relevant cases
Overall, there will still tend to be some efficiency advantage of having more of the less-costly group, generally the control group
Unbalanced samples are not a problem per se. You can run into a problem of representation/generalization for the smaller sample but this argument is independent of balancing and only has to do with small sample sizes.
@david_reinstein made an excellent point about heteroscedasticity / variance. To factor this into your original post: You want to optimize the cost-effectiveness of the precision of your group-level difference score. This is achieved by minimizing the standard errors (SE) of the group-level estimates of each sample, which are just the standard deviations (SD) divided by the square root of the respective observations. So your term would expand to:
Control-to-treat-ratio = sqrt(treatment_cost/control_cost) * control_SD/treatment_SD.
The problem, in practice, is that you usually know the costs a priori but not the SDs. If variances are not equal, however, I would agree with @david_reinstein that the treatment group will more likely show greater variance on your outcome variable (if control group has more variance, I would rather reconsider the choice of the outcome variable).
If you want to read more about the concept of precision and its relation to statistical power (also cf. the paper that @Karthik Tadepalli cited), we just put together a preprint here that is supposed to double as a teaching ressource: https://doi.org/10.31234/osf.io/m8c4k (introduction and discussion will suffice since the middle part focusses on biological/neuroscientific measurements that have vastly different properties than, e.g., questionnaire data).
Here is the glossary that is mentioned in the paper: https://osf.io/2wjc4
And here is the associated Twitter post with some digest about the most important insights: https://twitter.com/bioDGPs_DGPA/status/1616014732254756865
Great argument. My guess for why this isn’t common based on a little experience is that the decision is usually sequential. First you calculate a sample size based on power requirements, and then you fundraise for that budget (and usually the grantmaker asks for your power calculations, so it does have to be sequential). This doesn’t inherently prevent you from factoring intervention cost into the power calculations, but it does mean the budget constraint is not salient.
I wouldn’t be too surprised ex ante if there are inefficiencies in how we do randomization. This is an area with quite active research, such as this 2022 paper which proposes a really basic shift in randomization procedures and yet shows its power benefits.
I’m confused why the process being sequential is a reason that this isn’t occurring. Suppose someone was writing a RCT grant proposal and knew in advance how expensive the treatment was compared to the control. They find the optimal ratio of treatment to control, based on the post above. Then, they ask for however much money they need to get a certain amount of power (which would be less money than they would have needed to ask for not doing this).
Or alternatively, run the sample size calculation as you suggest. Convert that into a $ figure, then use the information in the post above to get more power for that same amount of money and show the grant-maker the second version of one’s power calculations.
I’m surprised you retracted the comment because I agree with it and I’m not 100% sure what I meant. It is still a salience issue but I don’t think the sequential process really matters for that
To explain why I retracted: I re-read your original post and noticed that you were talking about salience, and I think you’re probably right that this isn’t a very salient aspect of the process. At first, I thought you were saying something like ‘the steps occur sequentially, so the suggestion of the post can’t be implemented’ which seems wrong. But ‘the steps occur sequentially, so it might not occur to someone to back-track in their thinking and revise the result they got in the first step afterwards’ seems probably right, although I have no idea how big of an explanation that is compared to other reasons the OP’s suggestion isn’t very common.
You seem to assume that there’s a linear relationship between the intervention and the effect. This might be the case for cash transfers but it’s not the case for many other interventions.
If you give someone half of a betnet they are not 50% as much protected.
When it comes to medical treatments it might be that certain side effects only appear at a given dose and as a result you have to do your clinical trial for the dose that you actually want to put into the pill that you sell.
Hi Christian—agreed but my argument here is really for fewer treatment participants, not smaller treatment doses