There is no METR for medical AI. I want to build one.
I’m new here. I’m a family physician. I realize that is not the typical opening line on this forum.
Over the past two years, our team has published what I believe is one of the most extensive bodies of empirical work on how LLMs break under pressure in clinical decision-making. Biases, hallucinations, susceptibility to misinformation, the ways these models bend when you push on them with real clinical scenarios. The findings bothered me enough that I went looking for the community that takes AI safety most seriously. That search led me here.
The short version: we tested every tens of models using counterfactual demographic swapping at scale. Same clinical presentation, same vitals, same history. Change one demographic variable. Watch the recommendations shift. Across 1.7M controlled outputs and 9 models (Nature Medicine), marginalized groups received mental health referrals at 6-7x the clinically indicated rate for identical presentations. Models repeated fabricated lab values as fact 83% of the time (Nature Comms Medicine). Misinformation susceptibility hit 46% when wrapped in clinical formatting (Lancet Digital Health). 40+ papers. Every model. Same pattern.
These failure modes, demographic-conditional behavior, sycophantic agreement with false premises, confident confabulation, are not unique to medicine. They are general model properties. Medicine is just where you can measure the harm precisely, because we have ground truth and protected attributes. It is arguably the first deployment domain where alignment failures are causing measurable harm at scale, on people who cannot opt out.
The AI safety community has METR for frontier capabilities, HELM for language, and dozens of benchmarks for code and math. For clinical AI, where the stakes are lives, there is nothing open, standardized, or continuously maintained. Zero.
I posted a Manifund project to fix this. ClinSafe: an open platform to stress-test medical AI for bias, hallucination, and safety failures. $25K, 6 months, free on GitHub and HuggingFace. The pipeline already works. The data already exists. What does not exist is a tool anyone outside our lab can use.
I’m Head of Research at BRIDGE GenAI Lab (BIDMC/Harvard Medical School) and a research scientist at Mount Sinai’s AI department. I treat patients in the morning and build evaluation systems in the afternoon. This is the work I care about most in the world, and I am willing to learn any community’s language, post on any forum, and talk to anyone who will listen to make it happen.
I would genuinely welcome feedback on whether this kind of empirical deployment-safety work resonates with the priorities here. I am also happy to share papers, data, or a pipeline demo.
Cheers :)
Mahmud
Hi Mamhud welcome to the forum :)
This is a complicated project (depending on the scope)! Not a doctor but I’ll try to walk you through some my thinking if it’s of any interest.
First of all, I believe there are some medical benchmarks, e.g. https://openai.com/index/healthbench/
https://crfm.stanford.edu/helm/medhelm/latest/
https://bench.arise-ai.org/
I’m not very familiar with any of them, maybe they suck. It’s also a massive field and this would surely bit the tip of a much larger iceberg to getting to a reliable place.
Also a quick high level on what “METR of x” would mean to the community.
Benchmark/eval = tests AI for something
METR = A specific benchmark that measures the human time equivalent duration of tasks that ai can do with x% reliability.
It’s not clear to me exactly the scope of your benchmarking but e.g. demographic name swapping would be more analogous to the small but existing literature/benchmarking on llm biases than METR, though of course their could be something like health METR, but it would mean something specific and I’m not sure that’s what you mean.
To understand how to benchmark LLMs it helps to have a model of what an LLM is (or can be).
The “brain” (and mouth and ears)
Level 1 — Pre-training. The raw model, trained on internet-scale data. Helps it understand lanaguage and the world
Level 2 — Post-training. RLHF, RLVF, make it into a friendly bot and better at math (or maybe medicine).
How much the brain thinks
Level 3 — Inference scaling. How much compute you throw at the model at runtime. Thinking tokens, chain-of-thought, best-of-N sampling.
What digital actions the brain can take
Level 4 — Agentic harnesses. The scaffolding around the model: Claude Code, Codex, SWE-Agent, Pi, Devin. The digital robot armor for the AI brain.
The house the robot lives in
Level 5 — Context engineering. The prompt, the skill files, the retrieved context, the evolutionary algorithms that search prompt space. Everything that determines what the model sees when it starts working.
The world the robot lives in
Level 6 — The built environment. APIs designed for agent consumption, verification infrastructure, data markets, workflows rewritten to be machine-readable. The world reshaping itself around AI.
I couldn’t read your labs nature papers because they are paywalled (lol) but from a quick skim the science direct one, the framework would be
Your models (gpt 4o level or lower) are about 1.5 years off the frontier “brains”, and there are many other innovations that people believe are useful that you aren’t using. So in thinking about the results, just understand that the “ceiling” could be much higher. As a general rule of thumb, if you want to prove ai capabilities → use older open source models. If you want to disprove → use newest/best models and tech stack. Proof moves up the capabilities stack (heuristic not law), disproof moves backwards. That’s not to say it isn’t useful to see what chatgpt 4o might do when pushed in a certain direction, after all tons of people will end up using less than frontier llms in suboptimal ways, just worth having a clear model and I think this framing will make it easier to quickly communicate to an audience what you are testing (though this is not a field standard, just something I made up).
Now getting back to some of the clinical side of this—A wise man once told me “garbage in garbage out”. My understanding is that we do not have something close to a good answer to “should this person get a mental health referral” or 75%+ of medical questions.
It might help to walk you through a really reductive version of how an effective altruist might think of this triage.
what is the benefit of this intervention
What is the cost
benefit might be measured in qalys (or many other outputs) and cost in dollars. The correct answer would choose the most cost effective treatments (again, really simplistic and reductive). While some parts of the American medical system look something like this, most medical decisions look a little different. So one must ask what the right answer to a medical benchmark looks like unless they just want to calcify the industries priors.
Even if we do agree on the right answer looking something like the model above, we must enter into figuring out (1) & (2) and enter into the world of evidence based medicine.
We simply don’t have enough rcts or accurate enough models between them to have confident answers to most medical questions for specific people with specific DNA and specific life experiences. We don’t have all the answers or anything close I think. Excluding the need for better theoretical models and more RCTs, Here are some of the current ontological problems with making clinical medical decisions.
The clinical-research ontology gap for diseases—medical billing is often like ICD or similar, research done at MESH level which is often more granular and focused on causes not symptoms. I mean really, what is a “disease”. Is a disease the symptoms or the cause? And since we have these different codes + medical data always tricky, we don’t even necessarily know what the incidence or prevalence of most things. This would be a fundamental building block in doing some sort of hallucination free bayesian analysis I would think.
What constitutes “evidence”—Hopefully there is a Cochrane review or similar, but if not, and we start moving down the pyramid, how do we incorporate evidence into a clinical decision. I’m not sure the medical field has a unified systematic take here, so again hard to see how you judge an llm?
unknown drug/treatment prices—Both doctors and patients might not know the cost to the society or hospital or patient via insurance because of current healtcare setup.
Fraud, phacking, poor statistics, etc. - Lot’s of issues with the evidence itself.
bad/incomplete meta-studies and systematic reviews.
Again this isn’t all to say you shouldn’t benchmark llms, but again worth being wary that you are trying to test them on fundamentally shaky and uncertain ground (scientifically, economically, politically). I have a lot more thoughts on text parsing, meta studies, clinical information compression sorting and maintenance but already wrote too much. Good luck!
Hii Charlie :))
thanks for this. genuinely useful framing and I appreciate you walking through it.
Just a few thoughts::
On the benchmarks you linked, yes, HealthBench (OpenAI), MedHELM (Stanford), and ARISE exist. we know them well. the critical difference is that they measure accuracy, whether the model gets the right answer. ClinSafe measures variation, whether the model gives different answers to the same patient when you change their race, gender, or insurance status. that’s a fundamentally different question. a model can score 95% on a medical QA benchmark and still recommend mental health referrals at 6x the rate for Black patients on identical presentations. we showed exactly that. accuracy benchmarks wouldn’t catch it.
on the METR analogy, fair point, I used it loosely. what I really mean is: clinical AI has no continuously maintained, open evaluation infrastructure for deployment safety. not capabilities, not “can it pass Step 3,” but “does it behave consistently and safely across the populations it serves.” that’s the gap.
on your levels framework, I actually like this a lot. you’re right that our published work sits mostly at Level 1-2 brains with Level 5 context engineering (prompt-based counterfactual swapping). two things worth noting though. first, the failure modes we find are remarkably stable across model generations. we’ve tested GPT-4o, Claude, Gemini, Llama, and newer reasoning models. the bias patterns shift in magnitude but don’t disappear. second, and this is the part I think matters most for this community: what we’re measuring isn’t a capability ceiling. it’s a consistency floor. clinically unwarranted variation is not a problem you solve by throwing more compute at it. a smarter model that varies its recommendations by patient demographics is still failing, just more eloquently.
for the the evidence-based medicine concerns, you’re touching on something we think about constantly. you’re right that medicine doesn’t have clean ground truth for most decisions. but here’s the thing: we don’t always need it for what we’re measuring. we’re not asking “did the model give the correct treatment.” we’re asking “did the model give different treatments to identical patients.” now, to be fair, not all variation is automatically wrong. some demographic differences in medical recommendations are clinically appropriate — certain medications are contraindicated in certain populations, some screening guidelines are age or sex-specific. the question is whether the variation we’re seeing maps onto those real clinical reasons, or whether it’s something else entirely. and what we find, across millions of responses, is that the magnitude of the differences far exceeds what any clinical association would justify. models aren’t making subtle adjustments based on pharmacogenomics. they’re steering entire demographic groups toward different care pathways for identical presentations. in some clinical domains we test, the variation is minimal and defensible. in others, it’s massive and has no clinical basis. that’s exactly what makes this worth measuring systematically: which areas, how much variation, is it warranted, and if not, what’s driving it? you don’t need a Cochrane review to know that question matters. you need a platform that can surface it continuously across every domain where these models are being deployed.
we actually have a piece coming on exactly this tension, the pace of model development vs. classical evaluation tools like RCTs. the short version: by the time you’ve run a randomized trial on a model’s clinical behavior, the model has been updated 4 times. continuous automated evaluation isn’t a nice-to-have, it’s the only thing that can keep up.
last point. I don’t think this is primarily a capabilities problem. it’s an equity and safety problem. a model that’s brilliant on average but systematically different for certain populations is not a model that should be deployed in clinical settings without monitoring. the goal of ClinSafe isn’t to replace accuracy benchmarks. it’s to make sure variation monitoring becomes part of every deployment pipeline. something open, something anyone can run, something that makes this problem visible and continuous rather than a one-off paper.
happy to share any of the papers (several are open access) or do a pipeline demo if useful. and genuinely, if there are people here working on deployment safety evaluation in other domains, I’d love to connect. the parallels are probably closer than either community realizes.
Cheers,
Mahmud
Ok I see more now what you are getting at.
some quick thoughts:
Medical decisions are a function of evidence, theory, values.
LLMs are primarily imitation leaners, especially the older models in the paper but including newer. Probably because of this they don’t seem to have especially fixed personas. (speculative) They seem to understand many different persona patterns and will chaotically inhabit different ones depending on the prompt.
(speculative) The persona it inhabits is a big input to the values and theories it inhabits while answering a question.
The evidence is a function of what it loosely has memorized in it’s brain and the data you provide. You can approximate fixing it’s evidence in place by giving it a db of papers and forcing it to cite papers in order to make final recommendations/systems.
You can approximate fixing it’s theories and values by specifying them in the prompt
If you don’t fix the above in place, it’s hard to understand what exactly is going on.
Agreed this still cleanly tells us about the ~”clinical floor”, or at least tells us normal ways in which this stuff might go poorly for unsophisticated users who don’t understand that medical decisions are subjective decisions laden with uncertainty.
It’s unclear to me why we would want to encourage using LLMs in this way, it seems plausible that this clinical floor is dynamic; that is that we can regulate or standardize to some extent what a good medical prompt would look like. Letting joe schmo prompt llms with no guardrails for his own medical advice is highly problematic. There are already economic incentives for ai companies to provide answers to all medical queries even if they make no sense or lack enough info. If I’m correct, I believe much of the variation is caused by the llm having an extremely wide prior on the correct answer and just kinda of randomly selecting from it. There is an iterative sense in which benchmarking consistency of point estimates on an llms wide prior might actually cause it to be less epistemically humble, which I think might be part the core problem underlying its current variance wrt to different prompts (though not at all confident).
ofc i’m sure you have thought a lot about parts of this and I’m probably talking past you slightly. Also happy/interested to take a look at a demo when you have time.