I think this article makes its case compellingly, and appreciate that you spell-out the sometimes subtle ways uncertainty gets handled.
Did the question “Why should justification standards be the same?” arise in a sociological / EA movement context? My interpretation (from the question wording alone) would be more epistemic, along the lines of the unity of science. In my view, standards for justification have to be standardized, otherwise they wouldn’t be standards; one could just offer an arbitrary justification to any given question.
I think I agree with the central theses here, as I read them: indeed, ideally we would (1) measure what happens to people individually, rather than on average, due to taking psychiatric drugs, and (2) measure an outcome that reflects people’s aggregate preference for their experience of life with the drug versus the counterfactual experience of life without the drug.
However, I think these problems are harder to resolve than the post suggests. Neither can be measured directly (outside circumscribed / assumption-laden situations) due to the fundamental problem of causal inference, which is not resolved by people’s self-reported estimates of individual causal effects. There are better approaches to consider than comparing averages, but, in my opinion, this is the default for practical causal inference reasons, rather than a failure to take phenomenology seriously.
I agree that (2) is more tractable; however, these improvements are non-trivial to implement. Continuing your example, if we reanalyze a trial to focus on patients with high baseline akathisia, who may be most affected by either a benefit or a harm, we have far fewer patients to analyze. What was once an adequately powered trial to detect a moderate effect in the full sample is now under-powered. The same issue arises when analyzing complex interactions: precisely estimating interaction effects generally requires far larger sample sizes than estimating main effects. So a trial designed to measure a main effect of a drug is unlikely to be sufficiently powered to estimate several interaction effects.
For either issue, the data is not already there in my view. That said, I may not be fully understanding what exactly you propose doing; are there examples of “[using] criticality and complex systems modeling tools to deal with symptom interactions” in a healthcare context that illustrate this sort of analysis?