I donât think the riders when discussing significant results along the lines of âbeing wrong 5% of the time in the long runâ sometimes doesnât make sense. Compare
How substantial are these (likely overestimated) associations? We highlight here only the largest detected effects in our data (odds ratio close to or above 2 times greater) that would be surprising to see, if there were no associations in reality and we accepted being wrong 5% of the time in the long run.
To:
Welch t-tests of gender against these scaled cause ratings have p-values of 0.003 or lower, so we can act as if the null hypothesis of no difference between genders is false, and we would not be wrong more than 5% of the time in the long run.
Although commonly the significance threshold is equated with the âtype 1 error rateâ which in turn is equated with âthe chance of falsely rejecting the null hypothesisâ, this is mistaken (1). P values are not estimates of the likelihood of the null hypothesis, but of the observation (as or more extreme) conditioned on that hypothesis. P(Null|significant result) needs one to specify the prior. Likewise, T1 errors are best thought of as the âriskâ of the test giving the wrong indication, rather than the risk of you making the wrong judgement. (Thereâs also some remarks on family-wise versus false discovery rates which can be neglected.)
So the first quote is sort-of right (although assuming the null then talking about the probability of being wrong may confuse rather than clarify), but the second one isnât: you may (following standard statistical practice) reject the null hypothesis given P < 0.05, but this doesnât tell you there is a 5% chance of the null being true when you do so.
I agree that âIf I have observed a p < .05, what is the probability that the null hypothesis is true?â is a different question than âIf the null hypothesis is true, what is the probability of observing this (or more extreme) dataâ. Only the latter question is answered by a p-value (the former needing some bayesian-style subjective prior). I havenât yet seen a clear consensus on how to report this in a way that is easy for the lay reader.
The phrases I employed (highlighted in your comment) were suggested in writing by Daniel Lakens, although I added a caveat about the null in the second quote which is perhaps incorrect. His defence of the phrase âwe can act as if the null hypothesis is false, and we would not be wrong more than 5% of the time in the long runâ is the specific use of the word âactâ, âwhich does not imply anything about whether this specific hypothesis is true or false, but merely states that if we act as if the null-hypothesis is false any time we observe p < alpha, we will not make an error more than alpha percent of the timeâ. I would be very interested if you have suggestions of a similar standard phrasing which captures both the probability of observing data (not a hypothesis) and is somewhat easy for a non-stats reader to grasp.
As an aside, what is your opinion on reporting p values greater than the relevant alpha level? Iâve read Daniel Lakens suggesting if you have p< .05 one could write something like âbecause given our sample size of 50 per group, and our alpha level of 0.05, only observed differences more extreme than 0.4 could be statistically significant, and our observed mean difference was 0.35, we could not reject the null hypothesisâ.â This seems a bit wordy for any lay reader but would it be worth even including in a footnote?
It was commendable to seek advice, but I fear in this case the recommendation you got doesnât hit the mark.
I donât see the use of âact (as if)â as helping much. Firstly, it is not clear what it means to be âwrong aboutâ âacting as if the null hypothesis is falseâ, but I donât think however one cashes this out it avoids the problem of the absent prior. Even if we say âWe will follow the policy of rejecting the null whenever p < alphaâ, knowing the error rate of this policy overall still demands a âgrand priorâ of something like âhow likely is a given (/ârandomly selected?) null hypothesis we are considering to be true?â
Perhaps what Lakens has in mind is as we expand the set of null hypothesis we are testing to some very large set the prior becomes maximally uninformative (and so alpha converges to the significance threshold), but this is deeply uncertain to meâand, besides, we want to know (and a reader might reasonably interpret the rider as being about) the likelihood of this policy getting the wrong result for the particular null hypothesis under discussion.
--
As I fear this thread demonstrates, p values are a subject which tends to get more opaque the more one tries to make them clear (only typically rivalled by âconfidence intervalâ). Theyâre also generally much lower yield than most other bits of statistical information (i.e. we generally care a lot more about narrowing down the universe of possible hypotheses by effect size etc. rather than simply excluding one). The write-up should be welcomed for providing higher yield bits of information (e.g. effect sizes with CIs, regression coefficients, etc.) where it can.
Most statistical work never bothers to crisply explain exactly what it means by âsignificantly different (P = 0.03)â or similar, and I think it is defensible to leave it at that rather than wading into the treacherous territory of trying to give a clear explanation (notwithstanding the fact the typical reader will misunderstand what this means). My attempt would be not to provide an âin-line explanationâ, but offer an explanatory footnote (maybe after the first p value), something like this:
Our data suggests a trend/âassociation between X and Y. Yet we could also explain this as a matter of luck: even though in reality X and Y are not correlated [or whatever], it may we just happened to sample people where those high in X also tended to be high in Y, in the same way a fair coin might happen to give more heads than tails when we flip it a number of times.
A p-value tells us how surprising our results would be if they really were just a matter of luck: strictly, it is the probability of our study giving results as or more unusual than our data if the ânull hypothesisâ (in this case, there is no correlation between X and Y) was true. So a p-value of 0.01 means our data is in the top 1% of unusual results, a p-value of 0.5 means our data is in the top half of unusual results, and so on.
A p-value doesnât say all that much by itselfâcrucially, it doesnât tell us the probability of the null hypothesis itself being true. For example, a p-value of 0.01 doesnât mean thereâs a 99% probability the null hypothesis is false. A coin being flipped 10 times and landing heads on all of them is in the top percentile (indeed, roughly the top 0.1%) of unusual results presuming the coin is fair (the ânull hypothesisâ), but we might have reasons to believe, even after seeing only heads after flipping it 10 times, to believe it is probably fair anyway (maybe we made it ourselves with fastidious care, maybe its being simulated on a computer and weâve audited the code, or whatever). At the other extreme, a P value of 1.0 doesnât mean we know for sure the null hypothesis is true: although seeing 5 heads and 5 tails from 10 flips is the least unusual result given the null hypothesis (and so all possible results are âas more more unusualâ than what weâve seen), it could be the coin is unfair and we just didnât see it.
What we can use a p-value for is as a rule of thumb for which apparent trends are worth considering further. If the p-value is high the âjust a matter of luckâ explanation for the trend between X and Y is credible enough we shouldnât over-interpret it, on the other hand, a low p-value makes the apparent trend between X and Y an unusual result if it really were just a matter of luck, and so we might consider alternative explanations (e.g. our data wouldnât be such an unusual finding if there really was some factor that causes those high in X to also be high in Y).
âHighâ and âlowâ are matters of degree, but one usually sets a âsignificance thresholdâ to make the rule of thumb concrete: when a p-value is higher than this threshold, we dismiss an apparent trend as just a matter of luckâif it is lower, we deem it significant. The standard convention is for this threshold to be p=0.05.
Good work. A minor point:
I donât think the riders when discussing significant results along the lines of âbeing wrong 5% of the time in the long runâ sometimes doesnât make sense. Compare
To:
Although commonly the significance threshold is equated with the âtype 1 error rateâ which in turn is equated with âthe chance of falsely rejecting the null hypothesisâ, this is mistaken (1). P values are not estimates of the likelihood of the null hypothesis, but of the observation (as or more extreme) conditioned on that hypothesis. P(Null|significant result) needs one to specify the prior. Likewise, T1 errors are best thought of as the âriskâ of the test giving the wrong indication, rather than the risk of you making the wrong judgement. (Thereâs also some remarks on family-wise versus false discovery rates which can be neglected.)
So the first quote is sort-of right (although assuming the null then talking about the probability of being wrong may confuse rather than clarify), but the second one isnât: you may (following standard statistical practice) reject the null hypothesis given P < 0.05, but this doesnât tell you there is a 5% chance of the null being true when you do so.
Hi, thanks.
I agree that âIf I have observed a p < .05, what is the probability that the null hypothesis is true?â is a different question than âIf the null hypothesis is true, what is the probability of observing this (or more extreme) dataâ. Only the latter question is answered by a p-value (the former needing some bayesian-style subjective prior). I havenât yet seen a clear consensus on how to report this in a way that is easy for the lay reader.
The phrases I employed (highlighted in your comment) were suggested in writing by Daniel Lakens, although I added a caveat about the null in the second quote which is perhaps incorrect. His defence of the phrase âwe can act as if the null hypothesis is false, and we would not be wrong more than 5% of the time in the long runâ is the specific use of the word âactâ, âwhich does not imply anything about whether this specific hypothesis is true or false, but merely states that if we act as if the null-hypothesis is false any time we observe p < alpha, we will not make an error more than alpha percent of the timeâ. I would be very interested if you have suggestions of a similar standard phrasing which captures both the probability of observing data (not a hypothesis) and is somewhat easy for a non-stats reader to grasp.
As an aside, what is your opinion on reporting p values greater than the relevant alpha level? Iâve read Daniel Lakens suggesting if you have p< .05 one could write something like âbecause given our sample size of 50 per group, and our alpha level of 0.05, only observed differences more extreme than 0.4 could be statistically significant, and our observed mean difference was 0.35, we could not reject the null hypothesisâ.â This seems a bit wordy for any lay reader but would it be worth even including in a footnote?
It was commendable to seek advice, but I fear in this case the recommendation you got doesnât hit the mark.
I donât see the use of âact (as if)â as helping much. Firstly, it is not clear what it means to be âwrong aboutâ âacting as if the null hypothesis is falseâ, but I donât think however one cashes this out it avoids the problem of the absent prior. Even if we say âWe will follow the policy of rejecting the null whenever p < alphaâ, knowing the error rate of this policy overall still demands a âgrand priorâ of something like âhow likely is a given (/ârandomly selected?) null hypothesis we are considering to be true?â
Perhaps what Lakens has in mind is as we expand the set of null hypothesis we are testing to some very large set the prior becomes maximally uninformative (and so alpha converges to the significance threshold), but this is deeply uncertain to meâand, besides, we want to know (and a reader might reasonably interpret the rider as being about) the likelihood of this policy getting the wrong result for the particular null hypothesis under discussion.
--
As I fear this thread demonstrates, p values are a subject which tends to get more opaque the more one tries to make them clear (only typically rivalled by âconfidence intervalâ). Theyâre also generally much lower yield than most other bits of statistical information (i.e. we generally care a lot more about narrowing down the universe of possible hypotheses by effect size etc. rather than simply excluding one). The write-up should be welcomed for providing higher yield bits of information (e.g. effect sizes with CIs, regression coefficients, etc.) where it can.
Most statistical work never bothers to crisply explain exactly what it means by âsignificantly different (P = 0.03)â or similar, and I think it is defensible to leave it at that rather than wading into the treacherous territory of trying to give a clear explanation (notwithstanding the fact the typical reader will misunderstand what this means). My attempt would be not to provide an âin-line explanationâ, but offer an explanatory footnote (maybe after the first p value), something like this: