The Guardrail: A free tool for tracking AI safety research from arXiv (feedback welcome)

I’ve built The Guardrail, a website that aggregates and curates AI safety research, and I’d value feedback from this community on whether it’s useful and how to improve it.

What it does

The site pulls new papers from arXiv daily and uses an LLM (Gemini 3 Flash) to:

  • Assess relevance to AI safety (filtering out the noise)

  • Categorise papers into 10 areas: AI Control, RLHF, I/​O Classifiers, Mechanistic Interpretability, Position Papers, Alignment Theory, Robustness, Evaluations, Governance, and Agent Safety, plus 52 more specific tags

  • Summarise abstracts into 1-2 sentences for quick scanning

I’ve also processed papers from NeurIPS and ICLR (2025 only for now) with the same tagging system.

There’s a weekly Editor’s Choice that ranks the top 10 papers by significance and novelty, available as an email digest for those who want it.

Why I built it

The volume of potentially safety-relevant research on arXiv is overwhelming. I wanted a way to stay current without manually scanning hundreds of abstracts. The LLM judge isn’t perfect, but it catches most things and dramatically reduces the filtering burden.

Current limitations

  • The relevance scoring has false positives and negatives (I’m iterating on the prompts)

  • Conference coverage is limited to NeurIPS and ICLR so far

  • The categorisation taxonomy might not match how everyone thinks about the field

What I’m looking for

  • Is this actually useful to people here, or does it duplicate existing resources?

  • Are there categories I should add or merge?

  • Which conferences should I prioritise adding next?

  • Any papers that were obviously miscategorised?

The site is open source (GitHub) and funded by a BlueDot Impact rapid grant, so I’m committed to maintaining and improving it.

Happy to answer questions about how the filtering works or take feature requests.

No comments.