GiveWell’s AI red-teaming limitations aren’t a model problem — they’re an architecture problem

In January 2026, GiveWell published a detailed account of their experiment using AI to red-team their charitable intervention research. It’s worth reading. They were honest about the results: roughly 15–30% of AI critiques were useful, with persistent hallucination, lost context, and unreliable quantitative estimates. They flagged multi-agent workflows as a future possibility but haven’t pursued them.

I think their diagnosis was wrong. The limitations they experienced aren’t ChatGPT’s fault — they’re a consequence of prompt-based, single-pass, monolithic-context architecture. The model is fine. The pipeline isn’t.

So I built a different one.

What I built

A six-stage multi-agent pipeline using only GiveWell’s own public materials — their intervention reports, published AI outputs, and cost-effectiveness spreadsheets. No privileged access. The improvement, if there is one, has to come from methodology alone.

The stages: Decomposer → Investigators (one per scoped thread) → Verifier → Quantifier → Adversarial Pair → Synthesizer.

Three design decisions did most of the work:

Scoped context per agent. No agent gets the whole filing cabinet. Each Investigator gets a CONTEXT.md defining what’s in scope, what data GiveWell uses, what adjustments are already made, and what not to re-examine. This eliminates the lost-context failure mode GiveWell identified.

Verification as a first-class stage. Every citation and factual claim is independently checked by a separate Verifier agent before reaching a human. Hypothesis generation and evidence retrieval are deliberately separated — this is where hallucinations die.

Quantitative grounding via code execution. The Quantifier runs programmatically against GiveWell’s actual CEA spreadsheet. No ungrounded “could reduce cost-effectiveness by 15–25%” without showing which parameter moves and by how much.

Phase 1 results: water chlorination

I chose water chlorination first because it’s where GiveWell’s AI output had hallucinated citations — a concrete baseline to beat.

MetricGiveWell baselinePhase 1 result
Signal rate~15–30%~90% (28 of 31 critiques)
Hallucination rateMultiple per runZero
Novel findings1–24 critical, 3 moderate
Quantitative specificityUngrounded estimatesParameter-linked sensitivity ranges

A note on the signal rate: 30 of 31 critiques passed the Verifier, and 28 of 30 survived adversarial review. I want to be transparent that a ~90% pass rate may indicate the filters are too permissive rather than the Investigators being unusually precise — likely some of both. I’m reporting it honestly rather than as a clean win.

The 4 critical findings — Cryptosporidium resistance in chlorinated water, age-specific vulnerability patterns, adherence decay over time, and seasonal transmission gaps — are all connected to specific CEA parameters and survived both verification and adversarial challenge. GiveWell’s AI output identified the Cryptosporidium issue but without a verified citation or parameter linkage.

Full write-up, architecture spec, side-by-side comparison with GiveWell’s published output, and all seven agent prompts are at tsondo.com/​blog/​give-well-red-team.

Two versions for different audiences

If you work at GiveWell or a similar research organization: there’s a manual version — sequential prompts designed to run in a Claude Project with no engineering required.

If you’re a developer: there’s a Python pipeline with the full automated version, including the spreadsheet sensitivity analysis module.

Both are open source. The total API cost for Phase 1 was ~$30.

What I’d like

Direct engagement from anyone at GiveWell, or others who’ve worked on AI evaluation pipelines in research contexts. Phases 2 (ITNs) and 3 (SMC) are in progress.

If the methodology is wrong, I want to know. If it’s useful, I’d rather GiveWell use it than have it sit in a repo.

Reach me at todd@tsondo.com or @tsondo.com on BlueSky.