Reality is often underpowered

Introduction

When I worked as a doc­tor, we had a lec­ture by a pae­di­a­tric haema­tol­o­gist, on a con­di­tion called Acute Lym­phoblas­tic Leukaemia. I re­mem­ber be­ing im­pressed that very large pro­por­tions of pa­tients were be­ing offered tri­als ran­domis­ing them be­tween differ­ent treat­ment reg­i­mens, cur­rently in clini­cal equipoise, to es­tab­lish which had the edge. At the time, one of the ar­eas of in­ter­est was, given the dis­ease tended to have a good prog­no­sis, whether one could re­duce treat­ment in­ten­sity to re­duce the long term side-effects of the treat­ment whilst not ad­versely af­fect­ing sur­vival.

On a later ro­ta­tion I worked in adult medicine, and one of the pa­tients ad­mit­ted to my team had an ex­tremely rare can­cer,[1] with a (recog­nised) in­ci­dence of a hand­ful of cases wor­ld­wide per year. It hap­pened the world au­thor­ity on this con­di­tion worked as a pro­fes­sor of medicine in Lon­don, and she came down to see them. She ex­plained to me that treat­ment for this dis­ease was al­most en­tirely based on first prin­ci­ples, in­formed by a smat­ter­ing of case re­ports. The dis­ease un­for­tu­nately had a bleak prog­no­sis, al­though she was un­cer­tain whether this was be­cause it was an ag­gres­sive can­cer to which cur­rent med­i­cal sci­ence has no an­swer, or whether there was an effec­tive treat­ment out there if only it could be found.

I aver that many prob­lems EA con­cerns it­self with are closer to the sec­ond story than the first. That in many cases, suffi­cient data is not only ab­sent in prac­tice but im­pos­si­ble to ob­tain in prin­ci­ple. Real­ity is of­ten un­der­pow­ered for us to wring the an­swers from it we de­sire.

Big units of anal­y­sis, small samples

The main driver of this prob­lem for ‘EA top­ics’ is that the out­comes of in­ter­est have units of anal­y­sis for which the whole pop­u­la­tion (leave alone any sam­ple from it) is small-n: e.g. out­comes at the level of a whole com­pany, or a whole state, or whole pop­u­la­tions. For these big unit of anal­y­sis/​small sam­ple prob­lems, RCTs face formidable in prin­ci­ple challenges:

  1. Even if by magic you could get (e.g.) all coun­tries on earth to agree to ran­domly al­lo­cate them­selves to policy X or Y, this is merely a sam­ple size of ~200. If you’re look­ing at com­pa­nies rele­vant to cage-free cam­paigns, or ad­minis­tra­tive re­gions within a given state, this can eas­ily fall an­other or­der of mag­ni­tude.

  2. Th­ese units of anal­y­sis tend highly het­ero­ge­neous, al­most cer­tainly in ways that af­fect the out­come of in­ter­est. Although the key ‘sel­l­ing point’ of the RCT is it im­plic­itly con­trols for all con­founders (even ones you don’t know about), this statis­ti­cal con­trol is a (con­vex) func­tion of sam­ple size, and isn’t hugely im­pres­sive at ~ 100 per arm: it is well within the realms of pos­si­bil­ity for the ran­domi­sa­tion hap­pen to give arms with un­bal­anced al­lo­ca­tion of any given con­found­ing fac­tor.

  3. ‘Roughly’ (in ex­pec­ta­tion) bal­anced in­ter­ven­tion arms are un­likely to be good enough in cases where the in­ter­ven­tion is ex­pected to have much less effect on the out­come than other fac­tors (e.g. wealth, ed­u­ca­tion, size, what­ever), thus an effect size that favours one arm or the other can be al­ter­na­tively at­tributed to one of these.

  4. Sup­ple­ment­ing this raw ran­domi­sa­tion by ex­plic­itly con­trol­ling for con­founders you sus­pect (cf. block ran­domi­sa­tion, propen­sity match­ing, etc.) has limited value when don’t know all the fac­tors which plau­si­bly ‘swamp’ the likely in­ter­ven­tion effect (i.e. you don’t have a good pre­dic­tive model for the out­come but-for the in­ter­ven­tion tested). In any case, they tend to trade-off against the already scarce re­source of sam­ple size.

Th­ese ‘small sam­ple’ prob­lems aren’t pe­cu­liar to RCTs, but en­demic to all other em­piri­cal ap­proaches. The wealth of econo­met­ric and quasi-ex­per­i­men­tal meth­ods (e.g. IVs, re­gres­sion dis­con­ti­nu­ity anal­y­sis), still run up against hard data limits, as well those owed to in what­ever re­spect they fall short of the ‘ideal’ RCT set-up (e.g. im­perfect in­stru­men­ta­tion, omit­ted vari­able bias, nag­ging con­cerns about re­verse cau­sa­tion). Qual­i­ta­tive work (case stud­ies, etc.) have the same prob­lems even if other ones (e.g. se­lec­tion) loom larger.

Value of in­for­ma­tion and the mar­gin of com­mon-sense

None of this means such work has zero value—big enough effect sizes can still be re­li­ably de­tected, and even un­der­pow­ered stud­ies still give us in­for­ma­tion. But we may learn very lit­tle on the mar­gin of com­mon sense. Sup­pose we are in­ter­ested in ‘what makes so­cial move­ments suc­ceed or fail?’ and we ret­ro­spec­tively as­sess a (some­how) rep­re­sen­ta­tive sam­ple of so­cial move­ments. It seems plau­si­ble the re­sults of this in­ves­ti­ga­tion is the big (and plau­si­bly gen­er­al­is­able) hits may prove com­mon­sen­si­cal (e.g. “So­cial move­ments are more likely to grow if mem­bers talk to other peo­ple about the so­cial move­ment”), whilst the ‘new les­sons’ re­main equiv­o­cal and un­cer­tain.

We should ex­pect to see this if we be­lieve the dis­tri­bu­tion of rele­vant effect sizes is heavy-tailed, with most of the var­i­ance in (say) so­cial move­ment suc­cess owed to a small num­ber of fac­tors, with the rest com­prised of a large mul­ti­tude of smaller effects. In such case, mod­est in­creases in in­for­ma­tion (e.g. from small sam­ple data) may bring even more mod­est in­creases in ei­ther ex­plain­ing the out­come or iden­ti­fy­ing what con­tributes to it:

Imgur

Toy ex­am­ple, where we pro­pose a roughly pareto dis­tri­bu­tion of effect size among con­trib­u­tory fac­tors. The largest fac­tors (which nonethe­less ex­plain a minor­ity of the var­i­ance) may prove to be ob­vi­ous to the naked eye (blue). Ad­ding in the ac­cessible data may only slightly lower de­tec­tion thresh­old, with mod­est im­pacts on iden­ti­fy­ing fur­ther fac­tors (green) and over­all ac­cu­racy. The great bulk of the var­i­ance re­mains in virtue of a large en­sem­ble of small fac­tors which can­not be iden­ti­fied (red). Note that de­tec­tion thresh­old tends to have diminish­ing re­turns with sam­ple size.

The sci­en­tific rev­olu­tion for do­ing good?

The fore­go­ing should not be read as gen­eral scep­ti­cism to us­ing data. The triumphs of ev­i­dence-based medicine, al­though not un­al­loyed, have been sub­stan­tial, and there re­main con­sid­er­able gains that re­main on the table (e.g. lev­er­ag­ing rou­tine clini­cal prac­tice). The ‘ran­domista’ trend in in­ter­na­tional de­vel­op­ment is gen­er­ally one to cel­e­brate, es­pe­cially (as I un­der­stand) it in­creas­ingly aims to iso­late fac­tors that have cred­ible ex­ter­nal val­idity. The peo­ple who run cluster-ran­domised, stepped-wedge, and other study de­signs with big units of anal­y­sis are not ig­no­rant of their limi­ta­tions, and can de­ploy these ju­di­ciously.

But it should tem­per our en­thu­si­asm about how many in­sights we can glean by get­ting some data and do­ing some­thing sci­ency to it.[2] The early suc­cesses of EA in global health owes a lot to this be­ing one of the eas­ier ar­eas to get crisp, in­ter­sub­jec­tive and leg­ible an­swers from a wealth of available data. For many to most other is­sues, data-driven demon­stra­tion of ‘what re­ally works’ will never be pos­si­ble.

We see that peo­ple do bet­ter than chance (or bet­ter than oth­ers) in terms of pre­dic­tion and strate­gic judge­ment. Yet, at least judg­ing by the su­perfore­cast­ers (this writeup by AI im­pacts is an ex­cel­lent overview), how they do is much more in­di­rectly data-driven: one may have to weigh be­tween sev­eral fa­cially-rele­vant ‘base rates’, ad­just­ing these rates by fac­tors where the co­effi­cient may be es­ti­mated by role in loosely analo­gous cases, and so forth.[3] Although this pro­cess may be in­formed by statis­ti­cal and nu­mer­i­cal liter­acy (e.g. de­com­po­si­tion, ‘fermi-iza­tion’), it seems to me the main ac­tion go­ing on ‘un­der the hood’ is de­vel­op­ing a large (and im­plicit, and mostly illeg­ible) set of gestalts and im­pres­sions to de­ter­mine how to ‘weigh’ rele­vant data that is nonethe­less fairly re­mote to the ques­tion at is­sue.[4]

Three fi­nal EA take­aways:

  1. Most who (e.g.) write up a case study or a small-sam­ple anal­y­sis tend to be well aware of the limi­ta­tions of their work. Nonethe­less I think it is worth pay­ing more at­ten­tion to how these bear on over­all value of in­for­ma­tion be­fore one em­barks on these pieces of work. Small nuggets of in­for­ma­tion may not be worth the time to ex­ca­vate even when the au­di­ence are ideal rea­son­ers. As they aren’t, one risks them (or your­self) over-weigh­ing their value when con­sid­er­ing prob­lems which should de­mand tricky ag­gre­ga­tion of a mul­ti­tude of data sources.

  2. There can be good rea­sons why ex­pert com­mu­ni­ties in some ar­eas haven’t tried to use data ex­plic­itly to an­swer prob­lems in their field. In these cases, the ‘call­ing card’ of EA-style anal­y­sis of do­ing this any­way can be less of a dis­rup­tive break­through and more a stigma of in­tel­lec­tual naivete.

  3. In ar­eas where ‘be­ing driven by the data’ isn’t a huge ad­van­tage, it can be hard to iden­tify an ‘edge’ that the EA com­mu­nity has. There are other can­di­dates: in­ves­ti­gat­ing top­ics ne­glected by ex­ist­ing work, bet­ter al­igned in­cen­tives, etc. We should be scep­ti­cal of sto­ries which boil down a gen­er­al­ized ‘EA ex­cep­tion­al­ism’.


  1. Its name es­capes me, al­though ar­guably in­clud­ing it would risk de­duc­tive dis­clo­sure. To play it safe I’ve obfus­cated some de­tails. ↩︎

  2. And statis­tics and study de­sign gen­er­ally prove hard enough that ex­perts of­ten go wrong. Given the EA com­mu­nity’s gen­eral lack of cul­tural com­pe­tence in these ar­eas, I think their (gen­er­ally am­a­teur) efforts at the same have tended to fare worse. ↩︎

  3. I take as sup­port­ive ev­i­dence a com­mon fea­ture among su­perfore­cast­ers is they read a lot—not just in ar­eas closely rele­vant to their fore­casts, but more broadly across his­tory, poli­tics, etc. ↩︎

  4. Some­thing analo­gous hap­pens in other ar­eas of ‘ex­pert judge­ment’, whereby ex­perts may not be able to ex­plain why they made a given de­ter­mi­na­tion. We know that this im­plicit ex­pert judge­ment can be out­performed by sim­ple ‘rea­soned rules’. My sus­pi­cion, how­ever, is it still performs bet­ter than chance (or in­ex­pert judge­ment) when such rules are not available. ↩︎