Out of 55 2-sample t-tests, we would expect 2 to come out “statistically significant” due to random chance, but I found 10, so we can expect most of these to point to actually meaningful differences represented in the survey data.
Not sure if that is what you asked for, but here my attempt to spell this out, almost more to order my own thoughts:
assuming the null-hypothesis “There is no personality difference in a [personality trait Y] between people prioritizing vs. not prioritizing [cause area X].”, the false-positive rate of the t-test is designed to be 5%
i.e. even if there is no difference, due to random variation we expect differences in the sample averages anyway and we only want to decide “There is a difference!” if the difference is big enough/unlikely enough when assuming there is no difference in reality
we decide to call a difference “significant” if the observed difference is less than 5% likely due to random variation only
so, if we do one hundred t-tests where there is actually no difference in reality, only by random variation we expect to see 5% of them to show significant differences in the averages of the samples
same goes for 55 t-tests, where we expect 55*5%=2.75 significant results if there is no difference in real life
so instead seeing 10 significant results is very unlikely when we assume the null-hypothesis
how unlikely can be calculated with the cumulative distribution function of the binomial distribution: 55 repetitions with p=5% gives a probability of 0.04% that 10 or more tests would be significant due to random chance alone
therefore, given the assumptions of the t-test, there is a 99.96% probability that the observed differences of personality differences are not all due to random variation
Hmm, do you maybe mean “based on a real effect” when you say significant? Because we already now that 10 of the 55 tests came out significant, so I don’t understand why we would want to calculate the probability of these results being significant. I was calculating the probability of seeing the 10 significant differences that we saw, assuming all the differences we observed are not based on real effects but on random variation, or basically
p(observing differences in the comparisons that are so high that they the t-test with a 5% threshold says ‘significant’ ten out of 55 times | the differences we saw are all just based on random variation in the data).
In case you find this confusing, that is totally on me. I find signicance testing very unintuitive and maybe shouldn’t even have tried to explain it. :’) Just in case, chapter 11 in Doing Bayesian Data Analysis introduces the topic from a Bayesian perspective and was really useful for me.
Is there a more rigorous form of this argument?
There are lots of different ways to control for multiple comparisons: https://en.wikipedia.org/wiki/Multiple_comparisons_problem#Controlling_procedures
I second this question. Intuitively, your argument makes sense and you have something here.
But I would have more confidence in the conclusion if a False Discovery Rate correction was applied. This is also called a Benjamini-Hochberg procedure (https://en.wikipedia.org/wiki/False_discovery_rate#Controlling_procedures).
In R, the stats package makes it very easy to apply the false discovery rate correction to your statistics—see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html. You would do something like
where p is a vector/list of all 55 of your uncorrected p-values from your t-tests.
Not sure if that is what you asked for, but here my attempt to spell this out, almost more to order my own thoughts:
assuming the null-hypothesis “There is no personality difference in a [personality trait Y] between people prioritizing vs. not prioritizing [cause area X].”, the false-positive rate of the t-test is designed to be 5%
i.e. even if there is no difference, due to random variation we expect differences in the sample averages anyway and we only want to decide “There is a difference!” if the difference is big enough/unlikely enough when assuming there is no difference in reality
we decide to call a difference “significant” if the observed difference is less than 5% likely due to random variation only
so, if we do one hundred t-tests where there is actually no difference in reality, only by random variation we expect to see 5% of them to show significant differences in the averages of the samples
same goes for 55 t-tests, where we expect 55*5%=2.75 significant results if there is no difference in real life
so instead seeing 10 significant results is very unlikely when we assume the null-hypothesis
how unlikely can be calculated with the cumulative distribution function of the binomial distribution: 55 repetitions with p=5% gives a probability of 0.04% that 10 or more tests would be significant due to random chance alone
therefore, given the assumptions of the t-test, there is a 99.96% probability that the observed differences of personality differences are not all due to random variation
It seems like you are calculating the chance that NONE of these results are significant, not the chance that MOST of them ARE (?)
Hmm, do you maybe mean “based on a real effect” when you say significant? Because we already now that 10 of the 55 tests came out significant, so I don’t understand why we would want to calculate the probability of these results being significant. I was calculating the probability of seeing the 10 significant differences that we saw, assuming all the differences we observed are not based on real effects but on random variation, or basically
p(observing differences in the comparisons that are so high that they the t-test with a 5% threshold says ‘significant’ ten out of 55 times | the differences we saw are all just based on random variation in the data).
In case you find this confusing, that is totally on me. I find signicance testing very unintuitive and maybe shouldn’t even have tried to explain it. :’) Just in case, chapter 11 in Doing Bayesian Data Analysis introduces the topic from a Bayesian perspective and was really useful for me.
I think I understand what you are doing, and disagree with it being a way of meaningfully addressing my concern.