thanks for this. genuinely useful framing and I appreciate you walking through it.
Just a few thoughts::
On the benchmarks you linked, yes, HealthBench (OpenAI), MedHELM (Stanford), and ARISE exist. we know them well. the critical difference is that they measure accuracy, whether the model gets the right answer. ClinSafe measures variation, whether the model gives different answers to the same patient when you change their race, gender, or insurance status. that’s a fundamentally different question. a model can score 95% on a medical QA benchmark and still recommend mental health referrals at 6x the rate for Black patients on identical presentations. we showed exactly that. accuracy benchmarks wouldn’t catch it.
on the METR analogy, fair point, I used it loosely. what I really mean is: clinical AI has no continuously maintained, open evaluation infrastructure for deployment safety. not capabilities, not “can it pass Step 3,” but “does it behave consistently and safely across the populations it serves.” that’s the gap.
on your levels framework, I actually like this a lot. you’re right that our published work sits mostly at Level 1-2 brains with Level 5 context engineering (prompt-based counterfactual swapping). two things worth noting though. first, the failure modes we find are remarkably stable across model generations. we’ve tested GPT-4o, Claude, Gemini, Llama, and newer reasoning models. the bias patterns shift in magnitude but don’t disappear. second, and this is the part I think matters most for this community: what we’re measuring isn’t a capability ceiling. it’s a consistency floor. clinically unwarranted variation is not a problem you solve by throwing more compute at it. a smarter model that varies its recommendations by patient demographics is still failing, just more eloquently.
for the the evidence-based medicine concerns, you’re touching on something we think about constantly. you’re right that medicine doesn’t have clean ground truth for most decisions. but here’s the thing: we don’t always need it for what we’re measuring. we’re not asking “did the model give the correct treatment.” we’re asking “did the model give different treatments to identical patients.” now, to be fair, not all variation is automatically wrong. some demographic differences in medical recommendations are clinically appropriate — certain medications are contraindicated in certain populations, some screening guidelines are age or sex-specific. the question is whether the variation we’re seeing maps onto those real clinical reasons, or whether it’s something else entirely. and what we find, across millions of responses, is that the magnitude of the differences far exceeds what any clinical association would justify. models aren’t making subtle adjustments based on pharmacogenomics. they’re steering entire demographic groups toward different care pathways for identical presentations. in some clinical domains we test, the variation is minimal and defensible. in others, it’s massive and has no clinical basis. that’s exactly what makes this worth measuring systematically: which areas, how much variation, is it warranted, and if not, what’s driving it? you don’t need a Cochrane review to know that question matters. you need a platform that can surface it continuously across every domain where these models are being deployed.
we actually have a piece coming on exactly this tension, the pace of model development vs. classical evaluation tools like RCTs. the short version: by the time you’ve run a randomized trial on a model’s clinical behavior, the model has been updated 4 times. continuous automated evaluation isn’t a nice-to-have, it’s the only thing that can keep up.
last point. I don’t think this is primarily a capabilities problem. it’s an equity and safety problem. a model that’s brilliant on average but systematically different for certain populations is not a model that should be deployed in clinical settings without monitoring. the goal of ClinSafe isn’t to replace accuracy benchmarks. it’s to make sure variation monitoring becomes part of every deployment pipeline. something open, something anyone can run, something that makes this problem visible and continuous rather than a one-off paper.
happy to share any of the papers (several are open access) or do a pipeline demo if useful. and genuinely, if there are people here working on deployment safety evaluation in other domains, I’d love to connect. the parallels are probably closer than either community realizes.
Medical decisions are a function of evidence, theory, values.
LLMs are primarily imitation leaners, especially the older models in the paper but including newer. Probably because of this they don’t seem to have especially fixed personas. (speculative) They seem to understand many different persona patterns and will chaotically inhabit different ones depending on the prompt.
(speculative) The persona it inhabits is a big input to the values and theories it inhabits while answering a question.
The evidence is a function of what it loosely has memorized in it’s brain and the data you provide. You can approximate fixing it’s evidence in place by giving it a db of papers and forcing it to cite papers in order to make final recommendations/systems.
You can approximate fixing it’s theories and values by specifying them in the prompt
If you don’t fix the above in place, it’s hard to understand what exactly is going on.
Agreed this still cleanly tells us about the ~”clinical floor”, or at least tells us normal ways in which this stuff might go poorly for unsophisticated users who don’t understand that medical decisions are subjective decisions laden with uncertainty.
It’s unclear to me why we would want to encourage using LLMs in this way, it seems plausible that this clinical floor is dynamic; that is that we can regulate or standardize to some extent what a good medical prompt would look like. Letting joe schmo prompt llms with no guardrails for his own medical advice is highly problematic. There are already economic incentives for ai companies to provide answers to all medical queries even if they make no sense or lack enough info. If I’m correct, I believe much of the variation is caused by the llm having an extremely wide prior on the correct answer and just kinda of randomly selecting from it. There is an iterative sense in which benchmarking consistency of point estimates on an llms wide prior might actually cause it to be less epistemically humble, which I think might be part the core problem underlying its current variance wrt to different prompts (though not at all confident).
ofc i’m sure you have thought a lot about parts of this and I’m probably talking past you slightly. Also happy/interested to take a look at a demo when you have time.
Hii Charlie :))
thanks for this. genuinely useful framing and I appreciate you walking through it.
Just a few thoughts::
On the benchmarks you linked, yes, HealthBench (OpenAI), MedHELM (Stanford), and ARISE exist. we know them well. the critical difference is that they measure accuracy, whether the model gets the right answer. ClinSafe measures variation, whether the model gives different answers to the same patient when you change their race, gender, or insurance status. that’s a fundamentally different question. a model can score 95% on a medical QA benchmark and still recommend mental health referrals at 6x the rate for Black patients on identical presentations. we showed exactly that. accuracy benchmarks wouldn’t catch it.
on the METR analogy, fair point, I used it loosely. what I really mean is: clinical AI has no continuously maintained, open evaluation infrastructure for deployment safety. not capabilities, not “can it pass Step 3,” but “does it behave consistently and safely across the populations it serves.” that’s the gap.
on your levels framework, I actually like this a lot. you’re right that our published work sits mostly at Level 1-2 brains with Level 5 context engineering (prompt-based counterfactual swapping). two things worth noting though. first, the failure modes we find are remarkably stable across model generations. we’ve tested GPT-4o, Claude, Gemini, Llama, and newer reasoning models. the bias patterns shift in magnitude but don’t disappear. second, and this is the part I think matters most for this community: what we’re measuring isn’t a capability ceiling. it’s a consistency floor. clinically unwarranted variation is not a problem you solve by throwing more compute at it. a smarter model that varies its recommendations by patient demographics is still failing, just more eloquently.
for the the evidence-based medicine concerns, you’re touching on something we think about constantly. you’re right that medicine doesn’t have clean ground truth for most decisions. but here’s the thing: we don’t always need it for what we’re measuring. we’re not asking “did the model give the correct treatment.” we’re asking “did the model give different treatments to identical patients.” now, to be fair, not all variation is automatically wrong. some demographic differences in medical recommendations are clinically appropriate — certain medications are contraindicated in certain populations, some screening guidelines are age or sex-specific. the question is whether the variation we’re seeing maps onto those real clinical reasons, or whether it’s something else entirely. and what we find, across millions of responses, is that the magnitude of the differences far exceeds what any clinical association would justify. models aren’t making subtle adjustments based on pharmacogenomics. they’re steering entire demographic groups toward different care pathways for identical presentations. in some clinical domains we test, the variation is minimal and defensible. in others, it’s massive and has no clinical basis. that’s exactly what makes this worth measuring systematically: which areas, how much variation, is it warranted, and if not, what’s driving it? you don’t need a Cochrane review to know that question matters. you need a platform that can surface it continuously across every domain where these models are being deployed.
we actually have a piece coming on exactly this tension, the pace of model development vs. classical evaluation tools like RCTs. the short version: by the time you’ve run a randomized trial on a model’s clinical behavior, the model has been updated 4 times. continuous automated evaluation isn’t a nice-to-have, it’s the only thing that can keep up.
last point. I don’t think this is primarily a capabilities problem. it’s an equity and safety problem. a model that’s brilliant on average but systematically different for certain populations is not a model that should be deployed in clinical settings without monitoring. the goal of ClinSafe isn’t to replace accuracy benchmarks. it’s to make sure variation monitoring becomes part of every deployment pipeline. something open, something anyone can run, something that makes this problem visible and continuous rather than a one-off paper.
happy to share any of the papers (several are open access) or do a pipeline demo if useful. and genuinely, if there are people here working on deployment safety evaluation in other domains, I’d love to connect. the parallels are probably closer than either community realizes.
Cheers,
Mahmud
Ok I see more now what you are getting at.
some quick thoughts:
Medical decisions are a function of evidence, theory, values.
LLMs are primarily imitation leaners, especially the older models in the paper but including newer. Probably because of this they don’t seem to have especially fixed personas. (speculative) They seem to understand many different persona patterns and will chaotically inhabit different ones depending on the prompt.
(speculative) The persona it inhabits is a big input to the values and theories it inhabits while answering a question.
The evidence is a function of what it loosely has memorized in it’s brain and the data you provide. You can approximate fixing it’s evidence in place by giving it a db of papers and forcing it to cite papers in order to make final recommendations/systems.
You can approximate fixing it’s theories and values by specifying them in the prompt
If you don’t fix the above in place, it’s hard to understand what exactly is going on.
Agreed this still cleanly tells us about the ~”clinical floor”, or at least tells us normal ways in which this stuff might go poorly for unsophisticated users who don’t understand that medical decisions are subjective decisions laden with uncertainty.
It’s unclear to me why we would want to encourage using LLMs in this way, it seems plausible that this clinical floor is dynamic; that is that we can regulate or standardize to some extent what a good medical prompt would look like. Letting joe schmo prompt llms with no guardrails for his own medical advice is highly problematic. There are already economic incentives for ai companies to provide answers to all medical queries even if they make no sense or lack enough info. If I’m correct, I believe much of the variation is caused by the llm having an extremely wide prior on the correct answer and just kinda of randomly selecting from it. There is an iterative sense in which benchmarking consistency of point estimates on an llms wide prior might actually cause it to be less epistemically humble, which I think might be part the core problem underlying its current variance wrt to different prompts (though not at all confident).
ofc i’m sure you have thought a lot about parts of this and I’m probably talking past you slightly. Also happy/interested to take a look at a demo when you have time.