GiveWell’s AI red-teaming limitations aren’t a model problem — they’re an architecture problem
In January 2026, GiveWell published a detailed account of their experiment using AI to red-team their charitable intervention research. It’s worth reading. They were honest about the results: roughly 15–30% of AI critiques were useful, with persistent hallucination, lost context, and unreliable quantitative estimates. They flagged multi-agent workflows as a future possibility but haven’t pursued them.
I think their diagnosis was wrong. The limitations they experienced aren’t ChatGPT’s fault — they’re a consequence of prompt-based, single-pass, monolithic-context architecture. The model is fine. The pipeline isn’t.
So I built a different one.
What I built
A six-stage multi-agent pipeline using only GiveWell’s own public materials — their intervention reports, published AI outputs, and cost-effectiveness spreadsheets. No privileged access. The improvement, if there is one, has to come from methodology alone.
The stages: Decomposer → Investigators (one per scoped thread) → Verifier → Quantifier → Adversarial Pair → Synthesizer.
Three design decisions did most of the work:
Scoped context per agent. No agent gets the whole filing cabinet. Each Investigator gets a CONTEXT.md defining what’s in scope, what data GiveWell uses, what adjustments are already made, and what not to re-examine. This eliminates the lost-context failure mode GiveWell identified.
Verification as a first-class stage. Every citation and factual claim is independently checked by a separate Verifier agent before reaching a human. Hypothesis generation and evidence retrieval are deliberately separated — this is where hallucinations die.
Quantitative grounding via code execution. The Quantifier runs programmatically against GiveWell’s actual CEA spreadsheet. No ungrounded “could reduce cost-effectiveness by 15–25%” without showing which parameter moves and by how much.
Phase 1 results: water chlorination
I chose water chlorination first because it’s where GiveWell’s AI output had hallucinated citations — a concrete baseline to beat.
| Metric | GiveWell baseline | Phase 1 result |
|---|---|---|
| Signal rate | ~15–30% | ~90% (28 of 31 critiques) |
| Hallucination rate | Multiple per run | Zero |
| Novel findings | 1–2 | 4 critical, 3 moderate |
| Quantitative specificity | Ungrounded estimates | Parameter-linked sensitivity ranges |
A note on the signal rate: 30 of 31 critiques passed the Verifier, and 28 of 30 survived adversarial review. I want to be transparent that a ~90% pass rate may indicate the filters are too permissive rather than the Investigators being unusually precise — likely some of both. I’m reporting it honestly rather than as a clean win.
The 4 critical findings — Cryptosporidium resistance in chlorinated water, age-specific vulnerability patterns, adherence decay over time, and seasonal transmission gaps — are all connected to specific CEA parameters and survived both verification and adversarial challenge. GiveWell’s AI output identified the Cryptosporidium issue but without a verified citation or parameter linkage.
Full write-up, architecture spec, side-by-side comparison with GiveWell’s published output, and all seven agent prompts are at tsondo.com/blog/give-well-red-team.
Two versions for different audiences
If you work at GiveWell or a similar research organization: there’s a manual version — sequential prompts designed to run in a Claude Project with no engineering required.
If you’re a developer: there’s a Python pipeline with the full automated version, including the spreadsheet sensitivity analysis module.
Both are open source. The total API cost for Phase 1 was ~$30.
What I’d like
Direct engagement from anyone at GiveWell, or others who’ve worked on AI evaluation pipelines in research contexts. Phases 2 (ITNs) and 3 (SMC) are in progress.
If the methodology is wrong, I want to know. If it’s useful, I’d rather GiveWell use it than have it sit in a repo.
Reach me at todd@tsondo.com or @tsondo.com on BlueSky.
I am somewhat concerned about data contamination here: Are you sure that the original Givewell writeup has at no point leaked into your model’s analysis? Ie: was any of givewell’s analysis online before the august 2025 knowledge cutoff for GPT, or did your agents look at the Givewell report as part of their research?
Hi, Todd! Thank you for engaging with our work and writing up what you found.
Since that original post, we’ve also built a multi-agent system for red teaming that performs better than the one we described in our post. We made some different decisions around model architecture (most of our agents represent different red teaming “personas” as well as a few quality control stages) and I’d be curious to hear more about how you approach these architecture decisions.
I’ll reach out about a quick call!
Good to hear! All of my work is there on github. Please have a look at the results. If my pipeline found something that yours didn’t, it might be worth integrating the methodology.
I’d be very happy to discuss with you at your convenience. I’m in Central EU time (Italy.) I also sent you an email via research@GiveWell.org. Hannah says she will pass it on to you.
It’s really cool that you’ve done this and released the code!
Am I understanding right that the givewell baseline you’re trying to beat used GPT, while your approach uses Claude? How can you be sure that the improvements aren’t due to the model choice, rather than the architecture?