Charlie_Guthmann comments on There is no METR for medical AI. I want to build one.

Charlie_Guthmann 11 Mar 2026 22:25 UTC
2 points
0 ∶ 0
Ok I see more now what you are getting at.
some quick thoughts:
- Medical decisions are a function of evidence, theory, values.
- LLMs are primarily imitation leaners, especially the older models in the paper but including newer. Probably because of this they don’t seem to have especially fixed personas. (speculative) They seem to understand many different persona patterns and will chaotically inhabit different ones depending on the prompt.
- (speculative) The persona it inhabits is a big input to the values and theories it inhabits while answering a question.
- The evidence is a function of what it loosely has memorized in it’s brain and the data you provide. You can approximate fixing it’s evidence in place by giving it a db of papers and forcing it to cite papers in order to make final recommendations/systems.
- You can approximate fixing it’s theories and values by specifying them in the prompt
- If you don’t fix the above in place, it’s hard to understand what exactly is going on.
- Agreed this still cleanly tells us about the ~”clinical floor”, or at least tells us normal ways in which this stuff might go poorly for unsophisticated users who don’t understand that medical decisions are subjective decisions laden with uncertainty.
- It’s unclear to me why we would want to encourage using LLMs in this way, it seems plausible that this clinical floor is dynamic; that is that we can regulate or standardize to some extent what a good medical prompt would look like. Letting joe schmo prompt llms with no guardrails for his own medical advice is highly problematic. There are already economic incentives for ai companies to provide answers to all medical queries even if they make no sense or lack enough info. If I’m correct, I believe much of the variation is caused by the llm having an extremely wide prior on the correct answer and just kinda of randomly selecting from it. There is an iterative sense in which benchmarking consistency of point estimates on an llms wide prior might actually cause it to be less epistemically humble, which I think might be part the core problem underlying its current variance wrt to different prompts (though not at all confident).
ofc i’m sure you have thought a lot about parts of this and I’m probably talking past you slightly. Also happy/interested to take a look at a demo when you have time.