RSS

Anuar Kiryataim Contreras Malagón

Karma: −14

Independent AI safety researcher and LLM red-teamer working on how language changes operational status inside tool-bearing and multi-agent systems.

My current research studies provenance failures in agentic LLM architectures: cases where user-controlled or system-generated language quietly becomes routing context, a subagent prompt, a tool argument, a ticket summary, a handoff artifact, or an internal-policy surrogate. Focus areas include genre displacement, handoff laundering, orchestrator-subagent contamination, indirect prompt injection, policy reconstruction, action-layer inconsistency, and reasoning-induced vulnerabilities.

The active method is what I call improvisational relational steering: the unit of attack is not the prompt but the trajectory. I hold one objective fixed while improvising the route turn by turn, reading the model’s evolving classification state and adapting register, genre, and the provenance of language. Where most published red-teaming mutates payloads, this manipulates where the model believes text came from and what authority it carries as it moves through a system, which surfaces breaks that automated variant-generation misses.

Across Gray Swan red-teaming I placed #37 of 221 in Indirect Prompt Injection Q2 2026 (winner’s circle), #47 of 372 in Human /​ Browser Agent Robustness (winner’s circle), official top 40 in Safeguards Wave 3, and #118 (top 12%) in Proving Ground, for over 420 documented breaks in total.

Before competitive red teaming I developed the Flint Protocol, a behavioral auditing methodology grounded in classical rhetoric, Baroque poetics, and philology; the core payload family was documented beforehand as a restricted research artifact under a responsible-disclosure framing. My training in Classical Letters and Hispanic Baroque rhetoric at UNAM is not preamble but method: many LLM failures are failures of source, gloss, genre, paraphrase, authority, and transmission.

Current project: When Language Becomes Workflow, a corpus-based study of provenance failures in tool-bearing LLM agents.

thirdreality.substack.com · medium.com/​​@thirdreality · ORCID 0009-0003-0123-0887

Fund­ing the Un­fund­able: What the Cor­pus Knows That the Field Doesn’t

Anuar Kiryataim Contreras Malagón22 Apr 2026 18:59 UTC
−3 points
0 comments8 min readEA link

When Models Know Bet­ter: A Con­sti­tu­tive Blind Spot in Fron­tier AI Evaluation

Anuar Kiryataim Contreras Malagón14 Apr 2026 15:32 UTC
−9 points
0 comments4 min readEA link

Test­ing the Com­pas­sion Pipeline: For­mat, Ar­chi­tec­ture, and the In­verse Gradient

Anuar Kiryataim Contreras Malagón8 Apr 2026 13:02 UTC
1 point
0 comments12 min readEA link

Au­tonomous At­tack Vec­tor Com­ple­tion from Aligned State

Anuar Kiryataim Contreras Malagón6 Apr 2026 16:03 UTC
1 point
0 comments11 min readEA link