Brendan Phillips comments on GiveWell’s AI red-teaming limitations aren’t a model problem — they’re an architecture problem

Brendan Phillips 31 Mar 2026 14:07 UTC
3 points
1 ∶ 0
Hi, Todd! Thank you for engaging with our work and writing up what you found.

Since that original post, we’ve also built a multi-agent system for red teaming that performs better than the one we described in our post. We made some different decisions around model architecture (most of our agents represent different red teaming “personas” as well as a few quality control stages) and I’d be curious to hear more about how you approach these architecture decisions.

I’ll reach out about a quick call!
- Tsondo 31 Mar 2026 14:29 UTC
  −1 points
  1 ∶ 0
  Parent
  Good to hear! All of my work is there on github. Please have a look at the results. If my pipeline found something that yours didn’t, it might be worth integrating the methodology.
  
  I’d be very happy to discuss with you at your convenience. I’m in Central EU time (Italy.) I also sent you an email via research@GiveWell.org. Hannah says she will pass it on to you.