It’s great to see research like this being done. I strongly agree that more resources should be spent on efforts like this, and also that independent evaluation seems particularly useful. I also strongly concur that self-report measures about the usefulness of EA services are likely to often be significantly inflated due to social desirability / demand effects, and that efforts like this which try to assess objective differences in behaviour or attitudes seem neglected.
One minor point is that I didn’t follow the reasoning here:
The statistical significance and internal coherence of these initial differences between treatment and control groups yield two key lessons. First, the results provide a partial validation of the study design, showing that it is possible to detect meaningful and interpretable differences in EA behaviours and attitudes even with a fairly small sample size (n=78). This provides evidence that the survey design was capable of detecting impacts of the conference on attendees if these impacts existed.
If I’m reading this correctly, it seems like the first tests you’re referring to were looking at the large baseline differences between the different treatment and control groups (e.g. one group being older, more professionals etc.). But I don’t follow how this tells about the power to detect differences in changes in the outcome variables between the two groups. It seems plausible to me that we’d expect the effect size for differences in EA behaviours/attitudes as a result of the conference to be smaller than the initial difference in composition between conference attendees and non-attendees. Looking at the error bars on the plots for the key outcomes, the results don’t seem to be super tightly bounded. And various methods that one might do to account for possible differential attrition affecting these analyses would likely further reduce power. But either way, why not just directly look the power analyses for these latter tests, rather than first set of initial tests?
To be clear, I doubt this would change the substantive conclusions. And it further speaks to the need for more investment in this area so we can run larger samples.
The point I was trying to communicate here was simply that our design was able to find a pattern of differences between the control and treatment groups which is interpretable (i.e. in terms of different ages and career stage). I think this provides some validation of the design, in that if large enough differences exist then our measures pick up these differences and we can statistically measure them. We don’t, for instance, see an unintelligable mess of results that would cast doubt on the validity of our measures or the design itself. Of course, if as you point out the effect size for attending the conference is smaller then we won’t be able to detect that given our sample size. For most of our measures this was around 15-20%. But given we were able to measure sufficiently large effects using this design, I think it provides justification for thinking that a large enough sample size using a similar study design would be able to detect smaller effects, if they existed. Hope that clarifies a bit.
It’s great to see research like this being done. I strongly agree that more resources should be spent on efforts like this, and also that independent evaluation seems particularly useful. I also strongly concur that self-report measures about the usefulness of EA services are likely to often be significantly inflated due to social desirability / demand effects, and that efforts like this which try to assess objective differences in behaviour or attitudes seem neglected.
One minor point is that I didn’t follow the reasoning here:
If I’m reading this correctly, it seems like the first tests you’re referring to were looking at the large baseline differences between the different treatment and control groups (e.g. one group being older, more professionals etc.). But I don’t follow how this tells about the power to detect differences in changes in the outcome variables between the two groups. It seems plausible to me that we’d expect the effect size for differences in EA behaviours/attitudes as a result of the conference to be smaller than the initial difference in composition between conference attendees and non-attendees. Looking at the error bars on the plots for the key outcomes, the results don’t seem to be super tightly bounded. And various methods that one might do to account for possible differential attrition affecting these analyses would likely further reduce power. But either way, why not just directly look the power analyses for these latter tests, rather than first set of initial tests?
To be clear, I doubt this would change the substantive conclusions. And it further speaks to the need for more investment in this area so we can run larger samples.
Hi David,
The point I was trying to communicate here was simply that our design was able to find a pattern of differences between the control and treatment groups which is interpretable (i.e. in terms of different ages and career stage). I think this provides some validation of the design, in that if large enough differences exist then our measures pick up these differences and we can statistically measure them. We don’t, for instance, see an unintelligable mess of results that would cast doubt on the validity of our measures or the design itself. Of course, if as you point out the effect size for attending the conference is smaller then we won’t be able to detect that given our sample size. For most of our measures this was around 15-20%. But given we were able to measure sufficiently large effects using this design, I think it provides justification for thinking that a large enough sample size using a similar study design would be able to detect smaller effects, if they existed. Hope that clarifies a bit.