# [Question] Sample size and clustering advice needed

[July 30 up­date]: We have an an­swer re­gard­ing sam­ple size if in­tra-cluster cor­re­la­tion co­effi­cient is as­sumed zero. This sam­ple size calcu­la­tor can be used.

EA Cameroon needs statis­ti­cal im­pact eval­u­a­tion sam­ple size and cluster in­clu­sion ad­vice for their COVID-19 pro­ject. The pro­ject should ideally start to­ward the end of the week.

Data should be gath­ered be­fore and af­ter the main part of the pro­ject (af­ter one month).

The idea is to count the num­ber of per­sons out of a cer­tain num­ber who wear face cov­er­ing and how long this count­ing took. This in­for­ma­tion can be used as a proxy for pre­ven­tive mea­sures and so­cial dis­tanc­ing.

I would like to ask about the sam­ple size and in­clu­sion of clusters. There are 180 000 per­sons in the cam­paign area and 6 villages/​parts. Vol­un­teers would pre­fer not to travel to all 6 cam­paign, but more so an equal num­ber of non-cam­paign, villages, as the non-in­ter­ven­tion com­mu­ni­ties are dis­tant.

Differ­ent lan­guages are spo­ken in the 6 parts, but the cam­paign­ing will in­clude all of these lan­guages. Other­wise, the parts are similar. Since lit­tle in­for­ma­tion is cur­rently broad­cast, the cam­paign may in­crease the share of per­sons wear­ing a face cov­er­ing from 50% to at least 60% (or equiv­a­lent per­centage (20%) in­crease from an­other baseline). Can only e. g. 3+3 villages be in­cluded? 6 in­ter­ven­tion + 3 non-in­ter­ven­tion? How im­por­tant, in terms of statis­ti­cal power is to in­clude all clusters and an equal num­ber of non-in­ter­ven­tion cluster? How many per­sons should be ob­served at each place?

I will ap­pre­ci­ate any replies.

• Hey, thank you for the work you are do­ing! Here are my thoughts (I’m an economist at IDin­sight and work on this type of re­search):

• If you want to un­der­stand the im­pact of your pro­gram, I don’t recom­mend do­ing an RCT at this stage. This seems like a very small pi­lot and you won’t have enough power /​ sam­ple size to de­tect an effect (more see be­low). You should only con­sider run­ning an RCT if and when you plan to scale this up later to a suffi­cient scale.

• In­stead what I ad­vise is try­ing to un­der­stand and im­prove your im­pact by do­ing some small sam­ple sur­vey + qual­i­ta­tive re­search. E.g. when you go to a village, talk to lo­cals (ideally cap­ture a good rep­re­sen­ta­tion of differ­ent types of peo­ple in the com­mu­nity, not just lead­ers but also rel­a­tively marginal­ized groups; you could do a rigor­ous sam­pling but I’m not sure if that’s re­al­is­tic or worth­while at this stage given the trou­ble that in­volves) to un­der­stand their cur­rent knowl­edge, at­ti­tudes, and be­hav­ior around COVID (what knowl­edge they lack, what at­ti­tude needs changed, what ru­mors are around etc.) -- to bet­ter de­sign your mes­sages; also ask them what kind of in­for­ma­tion cam­paign would en­gage them, and af­ter you do your pro­gram ask how they felt—whether they liked it, whether they found it use­ful, what they learned, what they’d do differ­ently etc. Can also con­tact them some time later to see if they ob­serve any be­hav­ioral change among peo­ple in the com­mu­nity (bet­ter than ask­ing what they them­selves do due to so­cial de­sir­a­bil­ity bias).

More tech­ni­cal de­tails:

Since you’re do­ing a clus­tered RCT—treat­ment is at the village level and the out­comes of peo­ple within a village are likely pos­i­tively cor­re­lated—you’ll need a larger sam­ple size than if you were do­ing an in­di­vi­d­ual-level RCT (for the math, see sec­tion 4.2 of this—gen­er­ally a great re­source for RCT de­sign). You can do a power calcu­la­tion for a clus­tered ran­dom­ized con­trol­led trial, e.g. us­ing Stata’s “power twom­eans” com­mand. One pa­ram­e­ter that’s miss­ing is the in­tr­a­class cor­re­la­tion (cor­re­la­tion among in­di­vi­d­u­als within a treat­ment unit). How­ever, since your cluster size is SO small (3 and 3), when I try to do this calcu­la­tion in Stata with any rea­son­able as­sump­tion Stata says you can­not have enough power (as­sum­ing you want all the stan­dard -- 80% power, 5% sig­nifi­cance level etc.). That’s why I recom­mend not do­ing an RCT un­less you have a pro­gram at scale

• Hello Sindy,

Thank you so much. This an­swers my ques­tion. Yes, there will be a be­fore and af­ter qual­i­ta­tive sur­vey ask­ing about own and oth­ers’ be­hav­ior—which may need to be trun­cated to speak with more differ­ent groups. Then, the face cov­er­ing data can be used to com­ple­ment the sur­vey in­for­ma­tion.

• If you don’t already have it, I would strongly recom­mend get­ting a copy of Ger­ber & Green’s Field Ex­per­i­ments. I would also very strongly recom­mend that you (or EA Cameroon) en­gage an ex­per­i­men­tal method­ol­ogy ex­pert for this pro­ject, rather than pose the ques­tion on the fo­rum (I am not such an ex­pert).

It is very difficult to ad­dress all of these ques­tions in a broad way, since the an­swers de­pend on:

• The small­est effect size you would hope to observe

• The pop­u­la­tion within each cluster

• The to­tal population

I’m a lit­tle con­fused about the setup. You say that there are 6 groups— so how would it be pos­si­ble to have “6 in­ter­ven­tion + 3 non-in­ter­ven­tion?” Sorry if I’m mi­s­un­der­stand­ing.

In gen­eral, and par­tic­u­larly in this con­text, it makes sense to split your clusters evenly be­tween treat­ment and con­trol. This is the setup that min­i­mizes the stan­dard er­ror of the differ­ence be­tween groups. When the var­i­ance is larger, smaller effect sizes are difficult to de­tect. The smaller the num­ber of clusters in your con­trol group, for ex­am­ple, the larger the effect size that you would have to de­tect in or­der to make a statis­ti­cally defen­si­ble claim.

With such a small num­ber of clusters, effect sizes would have to be very large in or­der to be statis­ti­cally dis­t­in­guish­able from zero. If in­deed 50% of the pop­u­la­tion in these groups is already masked, 6 clusters may not be enough to see an effect.

Can we get some clar­ifi­ca­tion on some of your ques­tions? Par­tic­u­larly:

How im­por­tant, in terms of statis­ti­cal power is to in­clude all clusters

If you have only 6 to choose from, then the an­swer is very im­por­tant. But I’m not sure this is the sense in which you mean this.

How many per­sons should be ob­served at each place?

My in­cli­na­tion here is to say “as many as pos­si­ble.” But this is con­strained by your re­sources and your method of ob­ser­va­tion. Can you say more about the data col­lec­tion plan?

• Thank you. I was not able to get (a pdf of) Field Ex­per­i­ments, but down­loaded the “Field Ex­per­i­men­tal De­signs for the Study of Me­dia Effects,” also co-au­thored by Green. They point out “ro­bust cluster stan­dard er­rors” to es­ti­mate “in­di­vi­d­ual-level av­er­age treat­ment effect” (172).

• The small­est effect size you would hope to observe

• 20%. From 510 to 610 or equiv­a­lent % increase

• Re­searchers in all of the cam­paign clusters and some of the non-cam­paign ones. They can count whether e. g. few hun­dreds of in­di­vi­d­u­als wear face covering

• The pop­u­la­tion within each cluster

• Differ­ent, av­er­age of 180,0006 = 30,000.

• The to­tal population

• Since we are just look­ing to es­ti­mate the im­pact of the 180,000-per­son cam­paign and not to gen­er­al­ize it, this should be 180,000x2 (180,000 par­ti­ci­pat­ing and an equal num­ber of non-par­ti­ci­pants who are the near­est ge­o­graph­i­cally and in char­ac­ter­is­tics).

• Pro­bit, logit or sim­ple lin­ear re­gres­sion, but open to suggestions

I meant 6 groups in the in­ter­ven­tion area, and some num­ber of groups (e. g. 3 or 6) in the non-in­ter­ven­tion area.

OK. So 3 in­ter­ven­tion clusters and 3 non-in­ter­ven­tion clusters are bet­ter than 6 in­ter­ven­tion clusters and 3 non-in­ter­ven­tion clusters but 6+6 may be nec­es­sary? Would the an­swer de­pend on the in­tra-cluster cor­re­la­tion co­effi­cient (ρ)? Per­haps, the texts that gen­er­ally talk about clus­ter­ing as­sume rel­a­tively sig­nifi­cant be­tween cluster vari­abil­ity and low within cluster vari­abil­ity (so high ρ). How­ever, in this study, how peo­ple re­spond to the mes­sag­ing may not de­pend much on their ‘cluster as­sign­ment,’ but much more on their in­di­vi­d­ual char­ac­ter­is­tics that, on av­er­age, may be com­pa­rable across the clusters and the stud­ied pop­u­la­tion.

I should ask EA Cameroon about the pos­si­bil­ity of differ­ent av­er­age re­sponses in differ­ent villages.

Do you know of any on­line sam­ple size calcu­la­tor that in­cludes clusters?

• I re­fer you to Sindy’s com­ment (she is ac­tu­ally an ex­pert) but I want to note and ver­ify that it sounds as if you may not ac­tu­ally be think­ing of col­lect­ing in­di­vi­d­ual-level data, and that you’re think­ing of mak­ing ob­ser­va­tions at the village level (e.g. what % of peo­ple in this village wear masks?). So it’s not just the case that you wouldn’t have enough clusters to make a statis­ti­cal claim, but you may ac­tu­ally be talk­ing about do­ing an ex­per­i­ment in which the units are villages… so n = 6 to 12. Then of course you’d have con­sid­er­able er­ror in the village-level es­ti­mate, and un­cer­tainty about the rep­re­sen­ta­tive­ness about the sam­ple within each village. I agree with Sindy that you prob­a­bly don’t want an RCT here.

• OK, thank you.