Why Psychiatric Drug Evaluation Misses the Real Story

[Epistemic Status: Speculative but plausible, consistent with my personal experience, and very important if true. Specific pharmacology statistics and methods are for illustration only. The core argument about measurement scales doesn’t hinge on them.]


The puzzle is simple. Clinical trials for psychiatric drugs show modest improvements: effect sizes typically cluster in the small-to-moderate range depending on the condition and measurement. SSRIs for depression, antipsychotics for psychosis, benzodiazepines for panic: all follow this pattern. But in practice, people taking the same drugs report everything from clear benefit to total neutrality to severe deterioration. Ask someone with panic disorder about benzodiazepines and they’ll tell you either “it works wonders” or “it made everything worse”—nobody says with a straight face that they experienced a “small-to-moderate improvement”.

The standard explanation treats the extremely negative responses as “outliers” or “side-effects”. Squint and you can sort of see two “overlapping Gaussians”—most people get better, a few unlucky ones get side effects. You either “got better in expectation” (maybe just unlucky if you got worse in practice) or you were one of the very few unlucky who got the “side-effects”.

But what if there is a much more elegant description of what is going on? The distribution doesn’t look like two Gaussians even once you remove the most severe cases. It stays skewed, heavy-tailed in both directions. The wide range of responses isn’t noise around a true average but the very thing we need to explain and account for if we want to make informed decisions.

Let’s start simple, with a concrete example:

Imagine two people in a drug trial who start at the same baseline: 310 sadness and 310 sense of inner restlessness (akathisia). During the trial, the first person’s sadness moved from 310 to 610 and inner restlessness moved from 3 to 6 (also on a 0 to 10 scale). Now compare that with the other person, whose sadness stayed at 3 while their akathisia skyrocketed from 3 to 9. On a psychiatric evaluation form, both patterns might add up to the same “total change in symptom scores”. But as far as phenomenology goes, the case where akathisia shoots up to 910 is overwhelmingly worse. Anyone who has been near that state knows that a “9” on an akathisia scale is not three times worse than a “3.” It is another category of sensation altogether; indeed on another level of moral significance.

Some of us working in this space—people like Chris Percy, Alfredo Parra, and myself (see: 1, 2, 3, 4)—have pointed to a pattern that standard psychiatric measurements seemingly miss. Symptoms have long-tailed distributions at the level of actual intensity. When someone reports their akathisia as a “9,” that likely reflects being in a genuinely steep part of the distribution—a “9” is not simply three times as intense as what a “3” feels like. The problem emerges when trials collect these reports and add them arithmetically. But actual suffering seems to add up differently—not through simple addition of the scores, but through something closer to exponential weighting of the underlying intensities, and only then summation[1]. To a first approximation, a person’s experienced valence might be described as coming from summing the weighted contribution of each symptom, where the weights themselves depend on reported intensity level. When you account for this structure before adding, you get a different picture than when you add the reported scores directly.

Mixed valence complexifies the picture. Say an antipsychotic drug reduces delusions from a 710 to a 5/​10—a clear improvement on a steep region of the distribution of subjective discomfort. But to achieve this effect, it also raises akathisia from a 610 to an 810 as a side-effect. On a symptom scale, these changes might look like they roughly cancel out: you’ve gained 2 points on one domain and lost 2 on another. But the person’s actual experience isn’t well described by this simple arithmetic. Delusions at a 5 are genuinely better than delusions at a 7, but not by some fixed amount: the benefit sits on a very steep part of the distribution—the improvement is much larger than the numbers suggest. Akathisia at an 8 is worse than at a 6, and that cost sits on a yet steeper part of the distribution. The underlying intensities don’t offset the way the numbers suggest. The actual suffering from akathisia going from 6 to 8 may well exceed the actual relief from delusions going from 7 to 5 by a significant amount. For those to whom this happens, this is a net worsening, even though the trial might report such cases as “net neutral”. A logarithmic scale is being confused for a linear one, and as a consequence the side-effects are drastically minimized in the studies.

This repeats across psychiatric medications. Some people on a given drug experience genuine improvement. Others experience net worsening. Many sit in between. Trials aggregate across all these outcomes and produce an average that obscures both the clear beneficiaries and the clear sufferers. A 0.3 standard deviation improvement can emerge from a population where a substantial minority got substantially worse while others got modestly better. The net global valence: down the drain.

SSRIs often reduce rumination, mood instability, and behavioral volatility—important changes that often sit in shallow parts of the distribution yet show up clearly on symptom forms. At the same time, SSRIs also raise activation, nervous energy, autonomic instability, sexual frustration, sleep fragmentation, nausea, and in some users a restlessness that borders on akathisia. These might appear as one- or two-point increases on a scale that tracks symptoms. But a one-point increase in akathisia from baseline 7 sits on a much steeper part of the curve than the same increase from baseline 2. Trials treat these deltas as morally equivalent. Users experiencing them from elevated baselines, however, would describe them as central to their deteriorating condition.

Antipsychotics suppress delusions, racing thought, manic pressure—states that are typically at the very high end of the negative tail, so reducing them matters enormously. Trials capture this. The same medications produce akathisia, inner motor tension, and affective flattening. Reported severity is often mild. The underlying intensities can occupy steep regions of their distributions, however. Antipsychotic-induced akathisia might barely registers on standard scales despite being a very high-intensity state for those who start at a high baseline (e.g. due to low dopaminergic tone) or who respond poorly to the drug.

Benzodiazepines make these trade-offs most transparent. Acute use suppresses panic, autonomic arousal, early akathisia, sensory overwhelm—all typically in the steep regions of the valence distribution. Relief is immediate. However, frequent (“as prescribed”) use can cause severe rebounds, and here the overall picture becomes rather grim for many. Multiple symptom spikes that happen at once in the steep region of the valence scale together can snowball into “benzo hell”. These symptoms might be recorded as minor shifts across multiple items in the aggregate. But the person who experiences them as multiple intense sensations returning at once will tell a different story. When several long-tailed symptoms rebound in concert, the effect compounds in ways an arithmetic mean can’t possibly do justice.

The mismatch follows directly from how trials measure and aggregate. Psychiatric tools collect compressed reports. Trials add and average them. People live inside the full intensity structure—correlated, exponential, and with complex interactions between symptoms. When a drug improves shallow domains while perturbing steep ones, the average score often moves upward while a meaningful subset of patients experiences clear net worsening. This isn’t deception or incompetence—the measurement scale and lived experience run on different geometries.

To sum up. Modest average improvements in trials coexist with large individual harms in practice for straightforward reasons: psychiatric symptoms are long-tailed at the level of actual intensity, these tails often cluster, compressed scores systematically underrepresent the steepest domains, and states like akathisia consume enormous experiential bandwidth while barely registering in the arithmetic mean that drives clinical conclusions.

What changes if we take this seriously? Map individual response patterns separately instead of averaging into groups. Track steep regions of the distribution as a strong signal rather than business as usual. Use criticality and complex systems modeling tools to deal with symptom interactions. With these changes the same drugs would look very different on paper. I am not making a call to abandon psychiatric medication. This is a call to see it more clearly. To build evaluation around the actual geometry of subjective experience rather than the convenience of linear aggregation. Because linear aggregation is fatally misguided.

The data is already there. The variation is already visible to clinicians. The question is whether we organize our measurement to capture it and finally take phenomenology seriously.


[1] Plus some interaction terms between the symptoms, but we’ll leave a deep discussion on that topic for another day.