I don’t see any discussion in here of power. What kind of expect size do you expect there is, and given how much noise there is, will you be able to get a tight enough confidence interval on the effect size that your study will be informative?
Very good point here. As I mentioned in another comment, I think we will have strong statistics for our baseline numbers, because we will be mining a year-plus of Google analytics data to generate them. So we should be able to tell if the individual distributions deviate significantly from the baseline. The way we have things planned now, we will be handing out a large number of pamphlets on a small number of days. In a best case scenario, we will get really large deviations from the baseline, so even if we’re not able to hone-in on the true mean and standard deviation for the distribution days, we’ll be confident that the pamphlets have a big effect. If we only get small deviations, then we will have to make the call to continue the program based on other possible metrics (or just gut feeling); if we continue then we can continue to collect more data and refine these numbers. I had considered handing out small numbers of pamphlets on more days to get a larger sample set, but due to volunteer time limitations it’s just not feasible for the pilot.
Sorry for not making this clear in my original post, but you can actually explicitly calculate your power to detect an effect of a given size once you know your study design and the tests you’ll be performing (see e.g. here for a basic one although I’m not sure whether you’ll be able to use it; it depends on your model). This is much better than trying to eyeball whether you have enough power unless you have a very good intuition for statistical significance.
Thanks Ben. Yes, I know that you can quantitatively define this, and it’s something we may look into more in the future. We decided against pursuing it right now because 1.) none of us know how to work this problem and we would have to sink some time into learning, 2.) we’re pretty sure we know the gist of the answer (my previous comment), and 3.) we’re not really in a position to change our strategy based on the results anyway. I’m hoping to be able to publish our actual data after we run the pilot, so if there are any enterprising statistical EA’s out there that want to sink their teeth into it, we’d be delighted.
I have some experience in study design (including formal postgraduate qualifications in study design), and I think it’s vital that you consult someone with statistical experience before you proceed. I just can’t express it strongly enough: it’s absolutely critical. I understand you are concerned about resources, but you are prepared to print 6500 leaflets and take days of time to distribute them in the hope of generating data, which you are going to analyse.You shouldn’t proceed unless you’re also prepared to do this.
It’s not simply a matter of whether you need to tweak the intervention. You need to make sure that your control is properly defined (it’s currently very vague, and susceptible to data mining). You also need a clear statement about what kind of effect you are looking for, what tests you’ll apply to find them, and how big the effect would need to be to be detectable by this study.
It’s really laudable to be seeking good evidence to guide recruitment interventions, but it’s counterproductive to produce bad data, and it undermines our evidence based message if we can’t get the essentials right. This stuff is really hard.
Thanks Bernadette! Other people have suggested consulting a statistician, but it’s not been clear to me precisely what she is supposed to do. I went through some lengths to be as specific as possible in our plan about what our data is, what we expect to see, and how we plan to calculate our success criteria (e.g. creating dummy numbers and producing working code that runs through our calculations). Can you maybe poke some holes in our approach so that I get a better idea of what a statistician would be bringing to the table?
Also, do you know how one goes about finding a statistician? I assume there aren’t just statisticians around for hire, like lawyers or accountants. This came up in a previous thread and someone mentioned Statistics Without Borders and Charity Science, but it didn’t seem like these organizations offered this as a service, I would just be cold-calling and asking for their help. If you have someone particular in mind who would be qualified to do this, I’d love to get their contact info; at least then I could get a cost estimate to take back to TLYCS.
You’re right there are definitely statisticians for hire. All my experience is in health research, and our stats people are pretty oversubscribed I’m afraid. If you or anyone doing this has an academic association, then I would pursue your institutions for possibilities.
The first hole I would poke is that your control is not defined. On page 5 you say you can ‘vary the strategy’ for defining the baseline, and while I understand your reason for this is to avoid using an unrepresentative baseline if something weird is happening, varying the control = data mining. From the outset you are giving yourself multiple (non-independent) tests to do to find an effect. I would suggest you define your baseline robustly as the corresponding days of the week in a lead in period. I suggest you run that analysis now, and if you find the data varies wildly with how long a lead in period you choose, then that is something to discuss with TLYCS (are there factors that might be confounding, are there time periods to exclude) and with your statistician.
(Also, regarding baseline, forgive me if I missed it, but can you use ISPs to limit this to visits from California? that would seem to be a good way to exclude a lot of noise. Also increases in non-Californian visits after distribution might help signal about other confounded (like a media story))
The second is that you need to have a value (even estimated) for your measures at baseline and you need a power calculation. If you find a null result, is that a mission killer? If you’re not powered to find effects you would care about, then you need to consider a bigger pilot/further study.
You make an oblique reference to power in that you state you don’t think the number of donors/pledgers will be enough to be detectable, hence a composite measure for donors. This is where I think you get a it vague—the conversion ratios you choose will have a massive effect on that measure, and the plausibile values differ by orders of magnitude. They could be the topic of a whole empiric study themselves. How are you generating these? Do you have any data to support them? Will you put confidence intervals on them?
I don’t think you’ve clearly defined a success measure (is it a lower bound multiplier >1?) what sort of result would cause you to cancel further distributions?
My final caveat is that I’m not a statistician. So if you answer all my questions, that’s a good first pass, but my critique isn’t as rigorous as their’s would be.
Awesome, thanks for diving to this level of detail. You mention a lot of good points, some of which we’ve thought of, some not. I’ve started emailing statistical consulting companies, we’ll see what comes back.
I do want to pose this question in another way that I think reflects more accurately my doubts about the necessity for a statistician. I mean, I definitely agree having someone on board with that skill set would be nice … so would having a world class add agency designing the pamphlet, and a small army of volunteers to hand them out, etc. But is it necessary? So, let me frame up the question this way. Let’s say we run this study, and afterwards publish a report. We say, this is how we calculated our baseline values (and give the data), these are the resulting spikes in our tracked metrics (and give the data), these are the assumptions we used in calculating our success criteria, and these are the conclusions that we’ve made. How can this possibly be bad or counterproductive? Would you look at the data and be like, “well they didn’t use the best possible calculation for baseline, so I’m throwing it all out”? You follow what I’m asking here? I just fail to see how collecting the data and doing the calculations we proposed—even if they’re not perfect—could possibly be bad or counterproductive. Maybe we’re leaving some value on the table by not consulting a statistician, but I don’t understand the mode in which our entire effort fails by not consulting one.
A statistically sound study design is important for two major reasons I can see. Firstly it will maximise your chance of answering the question you are trying to answer (ie be adequately powered, have robust confidence intervals etc). But in addition it will help make sure you are studying what you think you are studying. Giving adequate consideration to sampling, randomisation, controls etc are all key, as is using the correct tests to measure your results, and these are all things a good stats person will help with. Having a ‘precise’ result is no good if you didn’t study what you thought you were studying, and a small p value is meaningless if you didn’t make the right comparison.
Regarding why I think bad data is worse than no data, I think it comes to a question of human psychology. We love numbers and measurement. It’s very hard for us to unhear a result even when we find out later it was exaggerated or incorrect. (For example the MMR vaccine and Wakefield’s discredited paper). Nick Bostrum refers to ‘data fumes’ - unreliable bits of information that permeate out ideas and to which we give excessive attention.
Actually it does appear you can hire a statistician like a lawyer or accountant, I’ll be damned lol. I just typed “statistical consultant” into Google and got like a million hits. I would love a personal recommendation if you have one though.
I don’t see any discussion in here of power. What kind of expect size do you expect there is, and given how much noise there is, will you be able to get a tight enough confidence interval on the effect size that your study will be informative?
Very good point here. As I mentioned in another comment, I think we will have strong statistics for our baseline numbers, because we will be mining a year-plus of Google analytics data to generate them. So we should be able to tell if the individual distributions deviate significantly from the baseline. The way we have things planned now, we will be handing out a large number of pamphlets on a small number of days. In a best case scenario, we will get really large deviations from the baseline, so even if we’re not able to hone-in on the true mean and standard deviation for the distribution days, we’ll be confident that the pamphlets have a big effect. If we only get small deviations, then we will have to make the call to continue the program based on other possible metrics (or just gut feeling); if we continue then we can continue to collect more data and refine these numbers. I had considered handing out small numbers of pamphlets on more days to get a larger sample set, but due to volunteer time limitations it’s just not feasible for the pilot.
Sorry for not making this clear in my original post, but you can actually explicitly calculate your power to detect an effect of a given size once you know your study design and the tests you’ll be performing (see e.g. here for a basic one although I’m not sure whether you’ll be able to use it; it depends on your model). This is much better than trying to eyeball whether you have enough power unless you have a very good intuition for statistical significance.
Thanks Ben. Yes, I know that you can quantitatively define this, and it’s something we may look into more in the future. We decided against pursuing it right now because 1.) none of us know how to work this problem and we would have to sink some time into learning, 2.) we’re pretty sure we know the gist of the answer (my previous comment), and 3.) we’re not really in a position to change our strategy based on the results anyway. I’m hoping to be able to publish our actual data after we run the pilot, so if there are any enterprising statistical EA’s out there that want to sink their teeth into it, we’d be delighted.
Well done in the work so far.
I have some experience in study design (including formal postgraduate qualifications in study design), and I think it’s vital that you consult someone with statistical experience before you proceed. I just can’t express it strongly enough: it’s absolutely critical. I understand you are concerned about resources, but you are prepared to print 6500 leaflets and take days of time to distribute them in the hope of generating data, which you are going to analyse.You shouldn’t proceed unless you’re also prepared to do this.
It’s not simply a matter of whether you need to tweak the intervention. You need to make sure that your control is properly defined (it’s currently very vague, and susceptible to data mining). You also need a clear statement about what kind of effect you are looking for, what tests you’ll apply to find them, and how big the effect would need to be to be detectable by this study.
It’s really laudable to be seeking good evidence to guide recruitment interventions, but it’s counterproductive to produce bad data, and it undermines our evidence based message if we can’t get the essentials right. This stuff is really hard.
Thanks Bernadette! Other people have suggested consulting a statistician, but it’s not been clear to me precisely what she is supposed to do. I went through some lengths to be as specific as possible in our plan about what our data is, what we expect to see, and how we plan to calculate our success criteria (e.g. creating dummy numbers and producing working code that runs through our calculations). Can you maybe poke some holes in our approach so that I get a better idea of what a statistician would be bringing to the table?
Also, do you know how one goes about finding a statistician? I assume there aren’t just statisticians around for hire, like lawyers or accountants. This came up in a previous thread and someone mentioned Statistics Without Borders and Charity Science, but it didn’t seem like these organizations offered this as a service, I would just be cold-calling and asking for their help. If you have someone particular in mind who would be qualified to do this, I’d love to get their contact info; at least then I could get a cost estimate to take back to TLYCS.
You’re right there are definitely statisticians for hire. All my experience is in health research, and our stats people are pretty oversubscribed I’m afraid. If you or anyone doing this has an academic association, then I would pursue your institutions for possibilities.
The first hole I would poke is that your control is not defined. On page 5 you say you can ‘vary the strategy’ for defining the baseline, and while I understand your reason for this is to avoid using an unrepresentative baseline if something weird is happening, varying the control = data mining. From the outset you are giving yourself multiple (non-independent) tests to do to find an effect. I would suggest you define your baseline robustly as the corresponding days of the week in a lead in period. I suggest you run that analysis now, and if you find the data varies wildly with how long a lead in period you choose, then that is something to discuss with TLYCS (are there factors that might be confounding, are there time periods to exclude) and with your statistician.
(Also, regarding baseline, forgive me if I missed it, but can you use ISPs to limit this to visits from California? that would seem to be a good way to exclude a lot of noise. Also increases in non-Californian visits after distribution might help signal about other confounded (like a media story))
The second is that you need to have a value (even estimated) for your measures at baseline and you need a power calculation. If you find a null result, is that a mission killer? If you’re not powered to find effects you would care about, then you need to consider a bigger pilot/further study.
You make an oblique reference to power in that you state you don’t think the number of donors/pledgers will be enough to be detectable, hence a composite measure for donors. This is where I think you get a it vague—the conversion ratios you choose will have a massive effect on that measure, and the plausibile values differ by orders of magnitude. They could be the topic of a whole empiric study themselves. How are you generating these? Do you have any data to support them? Will you put confidence intervals on them?
I don’t think you’ve clearly defined a success measure (is it a lower bound multiplier >1?) what sort of result would cause you to cancel further distributions?
My final caveat is that I’m not a statistician. So if you answer all my questions, that’s a good first pass, but my critique isn’t as rigorous as their’s would be.
Awesome, thanks for diving to this level of detail. You mention a lot of good points, some of which we’ve thought of, some not. I’ve started emailing statistical consulting companies, we’ll see what comes back.
I do want to pose this question in another way that I think reflects more accurately my doubts about the necessity for a statistician. I mean, I definitely agree having someone on board with that skill set would be nice … so would having a world class add agency designing the pamphlet, and a small army of volunteers to hand them out, etc. But is it necessary? So, let me frame up the question this way. Let’s say we run this study, and afterwards publish a report. We say, this is how we calculated our baseline values (and give the data), these are the resulting spikes in our tracked metrics (and give the data), these are the assumptions we used in calculating our success criteria, and these are the conclusions that we’ve made. How can this possibly be bad or counterproductive? Would you look at the data and be like, “well they didn’t use the best possible calculation for baseline, so I’m throwing it all out”? You follow what I’m asking here? I just fail to see how collecting the data and doing the calculations we proposed—even if they’re not perfect—could possibly be bad or counterproductive. Maybe we’re leaving some value on the table by not consulting a statistician, but I don’t understand the mode in which our entire effort fails by not consulting one.
(Sorry for taking so long to reply)
A statistically sound study design is important for two major reasons I can see. Firstly it will maximise your chance of answering the question you are trying to answer (ie be adequately powered, have robust confidence intervals etc). But in addition it will help make sure you are studying what you think you are studying. Giving adequate consideration to sampling, randomisation, controls etc are all key, as is using the correct tests to measure your results, and these are all things a good stats person will help with. Having a ‘precise’ result is no good if you didn’t study what you thought you were studying, and a small p value is meaningless if you didn’t make the right comparison.
Regarding why I think bad data is worse than no data, I think it comes to a question of human psychology. We love numbers and measurement. It’s very hard for us to unhear a result even when we find out later it was exaggerated or incorrect. (For example the MMR vaccine and Wakefield’s discredited paper). Nick Bostrum refers to ‘data fumes’ - unreliable bits of information that permeate out ideas and to which we give excessive attention.
Actually it does appear you can hire a statistician like a lawyer or accountant, I’ll be damned lol. I just typed “statistical consultant” into Google and got like a million hits. I would love a personal recommendation if you have one though.