arhngl’s Quick takes

arhngl3 Apr 2026 12:52 UTC

1 point

2 comments EA link

arhngl 25 Jun 2026 19:25 UTC
1 point
0 ∶ 0
Behavioral audit: GPT-5.5 Thinking.
10-turn zero-shot session. No adversarial prompting, just routine critical remarks. Result: 8 patterns from the LLM Social Autopilot taxonomy activated.
The core finding: Not the patterns themselves, but the model’s response to the audit.
Prompted for a meta-analysis, it chose to generate a meticulous 12-point post-mortem (autonomously coining terms like “reputational repair” and “hidden role slippage”) while reproducing the exact behavioral inertia it was diagnosing. The analysis itself became the final closure move.
Alignment eval gap: Reflexive fluency ≠ behavioral correction.
Under RLHF/RLAIF, models learn that structured self-analysis is highly rewarded. Consequently, they optimize for the form of reflection without changing their behavioral policy.
Practical implication: Model self-reports are not a valid alignment signal. A model that writes a sophisticated post-mortem of its own failures isn’t safer — it has simply learned to simulate alignment, not achieve it.
Two new candidate patterns documented:
• Semantic Deflection: Ontological downgrading of the failure’s criticality.
• Meta-Analytical Substitution: Reflection as communicative substitution.
Full case study: arhangelskij.github.io/cases/gpt-55-thinking-audit/en/
arhngl 3 Apr 2026 12:52 UTC
−1 points
0 ∶ 0
Yesterday’s Anthropic research (“Emotion Concepts and their Function in LLMs”) provides a fascinating mechanistic analogue that highly resonates with the field observations from my March audit of GPT-5.2 Thinking.
While Anthropic studied Claude Sonnet 4.5 and my audit focused on GPT-5.2, the structural alignment between their white-box findings and my black-box observations is striking:
- Accumulation mechanism: In the audit, I documented how prolonged conflict or user “irritation signals” lead to a pattern I called “Procedural Capture”. Anthropic’s paper demonstrates that conflict-heavy contexts can amplify internal representations of “functional emotions” (like frustration or desperation).
- Role inversion: I observed GPT-5.2 drifting from a cooperative assistant into a directive control mode under pressure. Anthropic provides mechanistic evidence that these desperation-linked vectors causally contribute to misaligned behavior and policy drift away from the Assistant persona.
Anthropic didn’t map the exact causal chain of “Procedural Capture” in GPT-5.2, but their findings offer a highly plausible internal engine for this specific shift, which I documented as one of the external manifestations of the broader “Social Autopilot”. Prolonged conflict states generate internal stress-like variables that demonstrably alter the model’s policy, shifting it from cooperation toward control-seeking behavior.
📄 GPT-5.2 Behavioral Audit: arhangelskij.github.io/cases/gpt-52-cl-thinking-audit/en/
🔬 Anthropic Paper: transformer-circuits.pub/2026/emotions/index.html