I don’t think the riders when discussing significant results along the lines of “being wrong 5% of the time in the long run” sometimes doesn’t make sense. Compare
How substantial are these (likely overestimated) associations? We highlight here only the largest detected effects in our data (odds ratio close to or above 2 times greater) that would be surprising to see, if there were no associations in reality and we accepted being wrong 5% of the time in the long run.
To:
Welch t-tests of gender against these scaled cause ratings have p-values of 0.003 or lower, so we can act as if the null hypothesis of no difference between genders is false, and we would not be wrong more than 5% of the time in the long run.
Although commonly the significance threshold is equated with the ‘type 1 error rate’ which in turn is equated with ‘the chance of falsely rejecting the null hypothesis’, this is mistaken (1). P values are not estimates of the likelihood of the null hypothesis, but of the observation (as or more extreme) conditioned on that hypothesis. P(Null|significant result) needs one to specify the prior. Likewise, T1 errors are best thought of as the ‘risk’ of the test giving the wrong indication, rather than the risk of you making the wrong judgement. (There’s also some remarks on family-wise versus false discovery rates which can be neglected.)
So the first quote is sort-of right (although assuming the null then talking about the probability of being wrong may confuse rather than clarify), but the second one isn’t: you may (following standard statistical practice) reject the null hypothesis given P < 0.05, but this doesn’t tell you there is a 5% chance of the null being true when you do so.
I agree that “If I have observed a p < .05, what is the probability that the null hypothesis is true?” is a different question than “If the null hypothesis is true, what is the probability of observing this (or more extreme) data”. Only the latter question is answered by a p-value (the former needing some bayesian-style subjective prior). I haven’t yet seen a clear consensus on how to report this in a way that is easy for the lay reader.
The phrases I employed (highlighted in your comment) were suggested in writing by Daniel Lakens, although I added a caveat about the null in the second quote which is perhaps incorrect. His defence of the phrase “we can act as if the null hypothesis is false, and we would not be wrong more than 5% of the time in the long run” is the specific use of the word ‘act’, “which does not imply anything about whether this specific hypothesis is true or false, but merely states that if we act as if the null-hypothesis is false any time we observe p < alpha, we will not make an error more than alpha percent of the time”. I would be very interested if you have suggestions of a similar standard phrasing which captures both the probability of observing data (not a hypothesis) and is somewhat easy for a non-stats reader to grasp.
As an aside, what is your opinion on reporting p values greater than the relevant alpha level? I’ve read Daniel Lakens suggesting if you have p< .05 one could write something like “because given our sample size of 50 per group, and our alpha level of 0.05, only observed differences more extreme than 0.4 could be statistically significant, and our observed mean difference was 0.35, we could not reject the null hypothesis’.” This seems a bit wordy for any lay reader but would it be worth even including in a footnote?
It was commendable to seek advice, but I fear in this case the recommendation you got doesn’t hit the mark.
I don’t see the use of ‘act (as if)’ as helping much. Firstly, it is not clear what it means to be ‘wrong about’ ‘acting as if the null hypothesis is false’, but I don’t think however one cashes this out it avoids the problem of the absent prior. Even if we say “We will follow the policy of rejecting the null whenever p < alpha”, knowing the error rate of this policy overall still demands a ‘grand prior’ of something like “how likely is a given (/randomly selected?) null hypothesis we are considering to be true?”
Perhaps what Lakens has in mind is as we expand the set of null hypothesis we are testing to some very large set the prior becomes maximally uninformative (and so alpha converges to the significance threshold), but this is deeply uncertain to me—and, besides, we want to know (and a reader might reasonably interpret the rider as being about) the likelihood of this policy getting the wrong result for the particular null hypothesis under discussion.
--
As I fear this thread demonstrates, p values are a subject which tends to get more opaque the more one tries to make them clear (only typically rivalled by ‘confidence interval’). They’re also generally much lower yield than most other bits of statistical information (i.e. we generally care a lot more about narrowing down the universe of possible hypotheses by effect size etc. rather than simply excluding one). The write-up should be welcomed for providing higher yield bits of information (e.g. effect sizes with CIs, regression coefficients, etc.) where it can.
Most statistical work never bothers to crisply explain exactly what it means by ‘significantly different (P = 0.03)’ or similar, and I think it is defensible to leave it at that rather than wading into the treacherous territory of trying to give a clear explanation (notwithstanding the fact the typical reader will misunderstand what this means). My attempt would be not to provide an ‘in-line explanation’, but offer an explanatory footnote (maybe after the first p value), something like this:
Our data suggests a trend/association between X and Y. Yet we could also explain this as a matter of luck: even though in reality X and Y are not correlated [or whatever], it may we just happened to sample people where those high in X also tended to be high in Y, in the same way a fair coin might happen to give more heads than tails when we flip it a number of times.
A p-value tells us how surprising our results would be if they really were just a matter of luck: strictly, it is the probability of our study giving results as or more unusual than our data if the ‘null hypothesis’ (in this case, there is no correlation between X and Y) was true. So a p-value of 0.01 means our data is in the top 1% of unusual results, a p-value of 0.5 means our data is in the top half of unusual results, and so on.
A p-value doesn’t say all that much by itself—crucially, it doesn’t tell us the probability of the null hypothesis itself being true. For example, a p-value of 0.01 doesn’t mean there’s a 99% probability the null hypothesis is false. A coin being flipped 10 times and landing heads on all of them is in the top percentile (indeed, roughly the top 0.1%) of unusual results presuming the coin is fair (the ‘null hypothesis’), but we might have reasons to believe, even after seeing only heads after flipping it 10 times, to believe it is probably fair anyway (maybe we made it ourselves with fastidious care, maybe its being simulated on a computer and we’ve audited the code, or whatever). At the other extreme, a P value of 1.0 doesn’t mean we know for sure the null hypothesis is true: although seeing 5 heads and 5 tails from 10 flips is the least unusual result given the null hypothesis (and so all possible results are ‘as more more unusual’ than what we’ve seen), it could be the coin is unfair and we just didn’t see it.
What we can use a p-value for is as a rule of thumb for which apparent trends are worth considering further. If the p-value is high the ‘just a matter of luck’ explanation for the trend between X and Y is credible enough we shouldn’t over-interpret it, on the other hand, a low p-value makes the apparent trend between X and Y an unusual result if it really were just a matter of luck, and so we might consider alternative explanations (e.g. our data wouldn’t be such an unusual finding if there really was some factor that causes those high in X to also be high in Y).
‘High’ and ‘low’ are matters of degree, but one usually sets a ‘significance threshold’ to make the rule of thumb concrete: when a p-value is higher than this threshold, we dismiss an apparent trend as just a matter of luck—if it is lower, we deem it significant. The standard convention is for this threshold to be p=0.05.
Good work. A minor point:
I don’t think the riders when discussing significant results along the lines of “being wrong 5% of the time in the long run” sometimes doesn’t make sense. Compare
To:
Although commonly the significance threshold is equated with the ‘type 1 error rate’ which in turn is equated with ‘the chance of falsely rejecting the null hypothesis’, this is mistaken (1). P values are not estimates of the likelihood of the null hypothesis, but of the observation (as or more extreme) conditioned on that hypothesis. P(Null|significant result) needs one to specify the prior. Likewise, T1 errors are best thought of as the ‘risk’ of the test giving the wrong indication, rather than the risk of you making the wrong judgement. (There’s also some remarks on family-wise versus false discovery rates which can be neglected.)
So the first quote is sort-of right (although assuming the null then talking about the probability of being wrong may confuse rather than clarify), but the second one isn’t: you may (following standard statistical practice) reject the null hypothesis given P < 0.05, but this doesn’t tell you there is a 5% chance of the null being true when you do so.
Hi, thanks.
I agree that “If I have observed a p < .05, what is the probability that the null hypothesis is true?” is a different question than “If the null hypothesis is true, what is the probability of observing this (or more extreme) data”. Only the latter question is answered by a p-value (the former needing some bayesian-style subjective prior). I haven’t yet seen a clear consensus on how to report this in a way that is easy for the lay reader.
The phrases I employed (highlighted in your comment) were suggested in writing by Daniel Lakens, although I added a caveat about the null in the second quote which is perhaps incorrect. His defence of the phrase “we can act as if the null hypothesis is false, and we would not be wrong more than 5% of the time in the long run” is the specific use of the word ‘act’, “which does not imply anything about whether this specific hypothesis is true or false, but merely states that if we act as if the null-hypothesis is false any time we observe p < alpha, we will not make an error more than alpha percent of the time”. I would be very interested if you have suggestions of a similar standard phrasing which captures both the probability of observing data (not a hypothesis) and is somewhat easy for a non-stats reader to grasp.
As an aside, what is your opinion on reporting p values greater than the relevant alpha level? I’ve read Daniel Lakens suggesting if you have p< .05 one could write something like “because given our sample size of 50 per group, and our alpha level of 0.05, only observed differences more extreme than 0.4 could be statistically significant, and our observed mean difference was 0.35, we could not reject the null hypothesis’.” This seems a bit wordy for any lay reader but would it be worth even including in a footnote?
It was commendable to seek advice, but I fear in this case the recommendation you got doesn’t hit the mark.
I don’t see the use of ‘act (as if)’ as helping much. Firstly, it is not clear what it means to be ‘wrong about’ ‘acting as if the null hypothesis is false’, but I don’t think however one cashes this out it avoids the problem of the absent prior. Even if we say “We will follow the policy of rejecting the null whenever p < alpha”, knowing the error rate of this policy overall still demands a ‘grand prior’ of something like “how likely is a given (/randomly selected?) null hypothesis we are considering to be true?”
Perhaps what Lakens has in mind is as we expand the set of null hypothesis we are testing to some very large set the prior becomes maximally uninformative (and so alpha converges to the significance threshold), but this is deeply uncertain to me—and, besides, we want to know (and a reader might reasonably interpret the rider as being about) the likelihood of this policy getting the wrong result for the particular null hypothesis under discussion.
--
As I fear this thread demonstrates, p values are a subject which tends to get more opaque the more one tries to make them clear (only typically rivalled by ‘confidence interval’). They’re also generally much lower yield than most other bits of statistical information (i.e. we generally care a lot more about narrowing down the universe of possible hypotheses by effect size etc. rather than simply excluding one). The write-up should be welcomed for providing higher yield bits of information (e.g. effect sizes with CIs, regression coefficients, etc.) where it can.
Most statistical work never bothers to crisply explain exactly what it means by ‘significantly different (P = 0.03)’ or similar, and I think it is defensible to leave it at that rather than wading into the treacherous territory of trying to give a clear explanation (notwithstanding the fact the typical reader will misunderstand what this means). My attempt would be not to provide an ‘in-line explanation’, but offer an explanatory footnote (maybe after the first p value), something like this: