“deciding, based on reason, that Exposure A is certain to have no effect on Outcome X, and then repeatedly running RCTs for the effect of exposure A on Outcome X to obtain a range of p values”

If the p-values have been calculated correctly and you run enough RCTS, then we already know what the outcome of this experiment will be: p<0.05 will occur 5% of the time, p<0.01 will occur 1% of the time, etc for all values of p between 0 and 1.

The other way round is more interesting, it will tell you what the “power” of your test was (https://en.wikipedia.org/wiki/Power_of_a_test), but that strongly depends on the size of the effect of B on X, as well as the sample size in your study. You’ll probably miss something if you pick a single B and X pair to represent your entire field.

I think the point is that any p-value threshold is arbitrary. The one you should use depends on context. It should depend on how much you care about false positives vs false negatives in that particular case, and on your priors. Also maybe we should just stop using p-values and switch to using likelihood ratios instead. Both of these changes might be useful things to advocate for, but I wouldn’t have thought changing one arbitrary threshold to another arbitrary threshold is likely to be very useful.

“The one you should use depends on context. It should depend on how much you care about false positives vs false negatives in that particular case”

Yep, exactly! Assume you’re a doctor, have a bunch of patients with a disease that is definitely going to kill them tomorrow, and there is a new, very low-cost, possible cure. Even if there’s only one study of this possible cure showing a p-value of 0.2, you really should still recommend it!

“deciding, based on reason, that Exposure A is certain to have no effect on Outcome X, and then repeatedly running RCTs for the effect of exposure A on Outcome X to obtain a range of p values”

If the p-values have been calculated correctly and you run enough RCTS, then we already know what the outcome of this experiment will be: p<0.05 will occur 5% of the time, p<0.01 will occur 1% of the time, etc for all values of p between 0 and 1.

The other way round is more interesting, it will tell you what the “power” of your test was (https://en.wikipedia.org/wiki/Power_of_a_test), but that strongly depends on the size of the effect of B on X, as well as the sample size in your study. You’ll probably miss something if you pick a single B and X pair to represent your entire field.

I think the point is that any p-value threshold is arbitrary. The one you should use depends on context. It should depend on how much you care about false positives vs false negatives in that particular case, and on your priors. Also maybe we should just stop using p-values and switch to using likelihood ratios instead. Both of these changes might be useful things to advocate for, but I wouldn’t have thought changing one arbitrary threshold to another arbitrary threshold is likely to be very useful.

“The one you should use depends on context. It should depend on how much you care about false positives vs false negatives in that particular case”

Yep, exactly! Assume you’re a doctor, have a bunch of patients with a disease that is definitely going to kill them tomorrow, and there is a new, very low-cost, possible cure. Even if there’s only one study of this possible cure showing a p-value of 0.2, you really should still recommend it!

There’s been a fair amount of discussion of this in the academic literature e.g. https://www.diva-portal.org/smash/get/diva2:1194016/FULLTEXT01.pdf and https://www.frontiersin.org/articles/10.3389/fpsyg.2018.00699/full