GiveWell’s AI red-teaming limitations aren’t a model problem — they’re an architecture problem

Tsondo31 Mar 2026 5:35 UTC

6 points

AI safety GiveWell Cause prioritization Red teaming AI interpretability Cost-effectiveness analysis AI evaluations and standards Global health & development Building effective altruism

In January 2026, GiveWell published a detailed account of their experiment using AI to red-team their charitable intervention research. It’s worth reading. They were honest about the results: roughly 15–30% of AI critiques were useful, with persistent hallucination, lost context, and unreliable quantitative estimates. They flagged multi-agent workflows as a future possibility but haven’t pursued them.

I think their diagnosis was wrong. The limitations they experienced aren’t ChatGPT’s fault — they’re a consequence of prompt-based, single-pass, monolithic-context architecture. The model is fine. The pipeline isn’t.

So I built a different one.

What I built

A six-stage multi-agent pipeline using only GiveWell’s own public materials — their intervention reports, published AI outputs, and cost-effectiveness spreadsheets. No privileged access. The improvement, if there is one, has to come from methodology alone.

The stages: Decomposer → Investigators (one per scoped thread) → Verifier → Quantifier → Adversarial Pair → Synthesizer.

Three design decisions did most of the work:

Scoped context per agent. No agent gets the whole filing cabinet. Each Investigator gets a CONTEXT.md defining what’s in scope, what data GiveWell uses, what adjustments are already made, and what not to re-examine. This eliminates the lost-context failure mode GiveWell identified.

Verification as a first-class stage. Every citation and factual claim is independently checked by a separate Verifier agent before reaching a human. Hypothesis generation and evidence retrieval are deliberately separated — this is where hallucinations die.

Quantitative grounding via code execution. The Quantifier runs programmatically against GiveWell’s actual CEA spreadsheet. No ungrounded “could reduce cost-effectiveness by 15–25%” without showing which parameter moves and by how much.

Phase 1 results: water chlorination

I chose water chlorination first because it’s where GiveWell’s AI output had hallucinated citations — a concrete baseline to beat.

Metric	GiveWell baseline	Phase 1 result
Signal rate	~15–30%	~90% (28 of 31 critiques)
Hallucination rate	Multiple per run	Zero
Novel findings	1–2	4 critical, 3 moderate
Quantitative specificity	Ungrounded estimates	Parameter-linked sensitivity ranges

A note on the signal rate: 30 of 31 critiques passed the Verifier, and 28 of 30 survived adversarial review. I want to be transparent that a ~90% pass rate may indicate the filters are too permissive rather than the Investigators being unusually precise — likely some of both. I’m reporting it honestly rather than as a clean win.

The 4 critical findings — Cryptosporidium resistance in chlorinated water, age-specific vulnerability patterns, adherence decay over time, and seasonal transmission gaps — are all connected to specific CEA parameters and survived both verification and adversarial challenge. GiveWell’s AI output identified the Cryptosporidium issue but without a verified citation or parameter linkage.

Full write-up, architecture spec, side-by-side comparison with GiveWell’s published output, and all seven agent prompts are at tsondo.com/blog/give-well-red-team.

Two versions for different audiences

If you work at GiveWell or a similar research organization: there’s a manual version — sequential prompts designed to run in a Claude Project with no engineering required.

If you’re a developer: there’s a Python pipeline with the full automated version, including the spreadsheet sensitivity analysis module.

Both are open source. The total API cost for Phase 1 was ~$30.

What I’d like

Direct engagement from anyone at GiveWell, or others who’ve worked on AI evaluation pipelines in research contexts. Phases 2 (ITNs) and 3 (SMC) are in progress.

If the methodology is wrong, I want to know. If it’s useful, I’d rather GiveWell use it than have it sit in a repo.

Reach me at todd@tsondo.com or @tsondo.com on BlueSky.

Tsondo31 Mar 2026 5:35 UTC

6 points

6 comments2 min readEA link

AI safety GiveWell Cause prioritization Red teaming AI interpretability Cost-effectiveness analysis AI evaluations and standards Global health & development Building effective altruism

titotal 31 Mar 2026 12:18 UTC
4 points
1 ∶ 0
I am somewhat concerned about data contamination here: Are you sure that the original Givewell writeup has at no point leaked into your model’s analysis? Ie: was any of givewell’s analysis online before the august 2025 knowledge cutoff for GPT, or did your agents look at the Givewell report as part of their research?
- Tsondo 8 Apr 2026 17:46 UTC
  1 point
  0 ∶ 0
  Parent
  The model I use did have an earlier cut off for it’s data, but that isn’t relevant for what I am doing. My write up actually surfaced several things that they didn’t see at all. That’s the point, really. And for the verication, it did not look at GiveWell at all. My verification sources are all listed in the code if you are concerned. All reputable sources.
Brendan Phillips🔸 31 Mar 2026 14:07 UTC
3 points
1 ∶ 0
Hi, Todd! Thank you for engaging with our work and writing up what you found.

Since that original post, we’ve also built a multi-agent system for red teaming that performs better than the one we described in our post. We made some different decisions around model architecture (most of our agents represent different red teaming “personas” as well as a few quality control stages) and I’d be curious to hear more about how you approach these architecture decisions.

I’ll reach out about a quick call!
- Tsondo 31 Mar 2026 14:29 UTC
  −1 points
  1 ∶ 0
  Parent
  Good to hear! All of my work is there on github. Please have a look at the results. If my pipeline found something that yours didn’t, it might be worth integrating the methodology.
  
  I’d be very happy to discuss with you at your convenience. I’m in Central EU time (Italy.) I also sent you an email via research@GiveWell.org. Hannah says she will pass it on to you.
tobycrisford 🔸 31 Mar 2026 19:14 UTC
1 point
0 ∶ 0
It’s really cool that you’ve done this and released the code!
Am I understanding right that the givewell baseline you’re trying to beat used GPT, while your approach uses Claude? How can you be sure that the improvements aren’t due to the model choice, rather than the architecture?
- Tsondo 8 Apr 2026 17:48 UTC
  1 point
  0 ∶ 0
  Parent
  If you read my blog post, I go into detail about why this is not a model issue. It’s about how you frame the question much more than what the model contains. For this purpose any decent model would have had the same result. The main benefit that Claude gives is direct in terminal code writing and execution.