Great list! I’ve actually been working on something that aligns closely with #3: I’ve been independently testing LLMs (including Gemini, Grok, DeepSeek etc.) for unexpected behavior under recursive prompt stress. I’ve been documenting my tests in red-team style forensic breakdowns that show when and how models deviate or degrade through persistent pressure.
The goal of this is absolutely to see and evaluate how agents behave in the wild. I believe that this is a critical safety test that cannot be missed.
I’d be curious to connect with others that are interested in research/testing from this angle.
Great list! I’ve actually been working on something that aligns closely with #3: I’ve been independently testing LLMs (including Gemini, Grok, DeepSeek etc.) for unexpected behavior under recursive prompt stress. I’ve been documenting my tests in red-team style forensic breakdowns that show when and how models deviate or degrade through persistent pressure.
The goal of this is absolutely to see and evaluate how agents behave in the wild. I believe that this is a critical safety test that cannot be missed.
I’d be curious to connect with others that are interested in research/testing from this angle.