A Practitioner-Accessible Error Taxonomy for the Missing Layer of AI Safety Classification
Author: Chance J. Beyer, working with Claude (Anthropic Opus 4.6) Date: April 2026
Abstract. The AI development community has extensive taxonomies for system-level failure modes (Microsoft, MAST), governance risks (NIST), and benchmark capabilities (HELM, BIG-bench). These frameworks serve the people who build AI systems. They do not serve the people who use them. There is no bridge between the end-using prosumer who experiences an AI failure and the developer or researcher who could act on it — no shared vocabulary, no common classification, no way for a practitioner to describe what went wrong in terms that route to the right engineering team. This paper proposes a classification and communication framework for practitioners — a shared vocabulary for naming how AI fails during sustained interaction. It is not a remediation method; it does not itself make AI safer, more accurate, or more reliable. This paper presents a 22-pattern taxonomy of human-AI interaction failures derived empirically from 660+ hours of sustained collaboration, positions it against five existing institutional frameworks, and demonstrates through operational evidence that the taxonomy is not merely descriptive but generative — with documented trigger conditions that have flagged error-prone conditions in real time across 80+ sessions, shifting the collaboration from reactive correction to anticipatory prevention. Three patterns (Prior Decay, Structural Momentum, Retrospective Coherence Bias) are invisible to every institutional framework because they only appear in sustained collaboration — no snapshot evaluation will ever catch them. The taxonomy fills the interaction layer — the missing floor in a three-layer model of AI failure classification — and provides the common vocabulary that connects the person experiencing the failure to the person who can fix it.
How This Started
I needed AI for legal research and to keep track of evidence and legal arguments across several tracks in state and federal domains from related lawsuits. I started reviewing my options with Claude, Anthropic’s Opus 4.6 model. I was aware of some of the tendencies for large language models to make mistakes — to fabricate citations, to fail at simple math, to lose track of what they’d been told three paragraphs ago. I had no experience working with AI. I had no firsthand experience of any of these problems. But I knew to watch for some of them, and I watched for others as they appeared.
Not being an experienced AI user, I approached the project conversationally. I could ask questions, and I could watch Claude’s reasoning process as it worked through an answer. I started to understand how it understood the questions I was asking, and I could often watch it start down the path of a mistake in real time.
Initially, I was just making corrections as they arose. But I started seeing patterns. The same kinds of mistakes, producing the same kinds of wrong results, in predictable circumstances. So I started identifying the pattern to Claude in the moment and asking what type of mistake it was — because naming the mistake helped me become a better user.
Sometimes Claude could provide an identity for the error that made sense. Sometimes it pointed to something close enough that I felt I had a better working understanding. We started categorizing the more frequently repeated mistakes by name, and over the course of 80+ sessions, we built an extensive taxonomy of errors — not from theory, but from the accumulated practical record of what actually went wrong and why.
I found that understanding errors this way made it easier to communicate my intentions to Claude, which meant we were less likely to produce new errors. It trained me to see errors as they were being generated, which meant they were less likely to get baked into the work in progress. The taxonomy wasn’t an academic exercise. It was a survival tool for a project where mistakes had real consequences.
About 700 hours into my own work, I finally reached a point where I could pull my head out of the project and look around at what else I had learned and what else I could have been working with from the beginning if I had known what to look for and where to look.
There are data access and retrieval tools that could have made some transitions smoother. But what wasn’t available — what still isn’t available — is a straightforward compendium of AI errors that I could understand as a practitioner and communicate to others. There are extensive libraries of errors fully accessible to AI development professionals. There is nothing that gives a simple language for the user to communicate to the developer. For a teacher to communicate with a student. There is no shared language between platforms, between experience levels, between the people who build AI and the people who use it every day.
There is a catch-all term — “hallucination” — that has become so broad it gives the user nothing more actionable than “here be monsters” on a medieval map. It tells you something might be wrong. It tells you nothing about what kind of wrong, why it happened, or what to do about it.
We need a common language that is accessible between student and teacher, between user and AI, between non-AI researcher and AI developer. That kind of language is going to grow inevitably — but without deliberate structure, it will grow organically, fractured, exclusionary to specialized interests and even generational slang. It will emerge the way all jargon emerges: useful to insiders, opaque to everyone else. That fragmentation will happen even with a codified taxonomy. But without one formally defined and recorded set of terms, there won’t be a Rosetta Stone that brings the nuance that rises in specialized communities back into a common root.
This paper is an attempt to build that Rosetta Stone.
Thesis
When a college student asks an AI for help with a research paper and gets a confidently wrong answer, they have no vocabulary for what happened — let alone the ability to recognize the pattern next time, prevent recurrence, or communicate the experience in terms that an AI developer could act on. When a small business owner spends three hours customizing an AI’s output only to find it has reverted to generic advice, they don’t know this is a documented, predictable pattern with a name, a known cause, and a known workaround.
The AI development community has taxonomies. Microsoft classifies failure modes in agentic systems. NIST maintains a 500-term glossary and a governance framework. Researchers publish taxonomies of multi-agent system failures, agentic AI faults, and LLM reasoning errors. None of these are written for the people who actually use AI every day. None of them help a prosumer recognize why the AI failed — the logic gap it fell through — or give them the vocabulary to communicate that experience to anyone else.
This paper proposes a practitioner-accessible taxonomy of AI interaction failures: a common language that bridges the gap between what everyday users experience and what developers need to hear. The taxonomy classifies errors not by their symptoms (hallucination, fabrication, inconsistency) but by the underlying logic patterns that produce them — because understanding the logic is what enables recognition, prevention, and cross-domain communication.
The Problem: Three Audiences, No Shared Language
Practitioners and prosumers
A dental hygienist using AI for patient education, a real estate agent using AI to draft listings, a paralegal using AI for case research, a high school student using AI for homework — all of these people encounter AI failures regularly. They describe them in domain-specific or colloquial terms: “it made something up,” “it forgot what I told it,” “it kept giving me the same wrong answer.” These descriptions are accurate but unclassifiable. They cannot be aggregated, compared across domains, or translated into engineering action.
Without a common vocabulary, every practitioner’s experience is isolated. A nursing student who discovers that AI confidently confirms incorrect dosage calculations when asked to verify its own work cannot connect that experience to a software developer who discovers the same pattern in code review, or to a legal researcher who documents the same pattern in case citation. All three have encountered the same underlying logic failure (Verification-Induced Fabrication — the AI has a structural incentive to confirm rather than recheck). None of them know that.
Students
High school and college students are forming their AI interaction habits now — habits they will carry into professional practice. If they learn to recognize error patterns early, they become competent AI collaborators in whatever field they enter. If they don’t, they develop uncritical dependence that becomes harder to correct with experience.
Students need a taxonomy that works like a field guide: broad categories that help them identify the general type of error (logic gap, training artifact, context failure), with pathways to domain-specialized documentation as they enter professional practice. A pre-med student who learns the general pattern of Confidence Calibration failure in an introductory AI literacy course should be able to find, later, the medical-domain-specific instances of that pattern — how it manifests in diagnostic AI, what the stakes are in clinical settings, what the field-specific workarounds look like.
This requires a taxonomy structured from general to specific, not the reverse. The existing institutional taxonomies start from specific system architectures (agentic pipelines, multi-agent systems, retrieval-augmented generation) and classify failures within those architectures. Students don’t know or care about system architecture. They need to start from the logic pattern — “the AI filled a gap with something plausible but wrong” — and navigate from there to the architectural cause, the domain-specific manifestation, and the appropriate response.
AI developers
Developers need practitioner-reported failure data organized in categories they can act on. A bug report that says “the AI hallucinated” is almost useless. A report classified as “Pattern I (Interpolation Error) — architectural: the model generated plausible content to bridge a gap in its actual knowledge, triggered when the user asked about [specific context]” tells the developer exactly where to look.
The cause-type classification matters here. If the error is a training artifact (the model learned a pattern from training data that doesn’t apply in this context), the fix is in training data or fine-tuning. If it’s architectural (the model’s attention mechanism loses track of earlier constraints as context grows), the fix is in architecture. If it’s a design tension (the model’s helpfulness training conflicts with its accuracy training), the fix requires a design decision about priorities. Developers can’t route errors to the right team without knowing the cause type.
The Existing Landscape: What Exists and What’s Missing
Microsoft Agentic AI Failure Taxonomy (2025): Classifies failures in agentic architectures into security failures (loss of confidentiality, availability, integrity) and safety failures (harm to users or society). Focused on multi-step autonomous AI systems. Not designed for human-AI collaborative interaction; not accessible to non-technical users.
MAST — Multi-Agent System Failure Taxonomy (2025): Empirically derived from 1,642 execution traces across seven multi-agent frameworks (arXiv:2503.13657). The closest methodologically to our approach (empirical, grounded in real execution data). But MAST classifies failures in agent-to-agent coordination — task decomposition errors, tool-use faults, state management failures. These are system-level failures, not human-AI interaction patterns.
PreFlect — Planning Error Taxonomy (Wang et al., arXiv:2602.07187, Feb 2026): Independent validation of the taxonomy-from-trajectories methodology. PreFlect distills failure patterns from historical agent trajectories using comparative diagnostic analysis (contrastive success/failure pairs), producing a domain-agnostic error catalog (insufficient constraint verification, ineffective tool selection, shallow content verification). The methodology — collect real execution data, diagnose failures comparatively, classify into reusable patterns — closely parallels our approach. The difference is in application and audience: PreFlect feeds its taxonomy into an automated reflector for agent plan-checking in constrained task environments. Our taxonomy is designed for human learning, HITL detection across diverse project types, and communication between practitioners and developers. PreFlect’s success on benchmarks (17% improvement on GAIA, 13% on SimpleQA) demonstrates that empirically derived error taxonomies produce measurable performance gains — the open question is whether human-facing taxonomies produce comparable gains in HITL effectiveness across diverse, unbounded project environments.
Agentic AI Fault Taxonomy (2025/2026): 37 fault categories in 13 groups from 385 real-world faults. Explicitly notes that “failures in agentic AI systems differ fundamentally from those in traditional software systems.” Software-engineering-oriented — faults are classified by architectural location (cognitive core, state management, tool invocation), not by the logic pattern a user would recognize.
NIST AI Risk Management Framework (AI RMF 1.0): Provides an extensive glossary and a four-function governance structure (Govern, Map, Measure, Manage). Designed for organizational risk management, not for understanding why a specific interaction failed. The glossary defines terms; it does not classify error patterns. A student or prosumer cannot use the NIST framework to understand what went wrong in their last conversation with ChatGPT.
HELM, BIG-bench, etc.: These evaluate what AI can do — accuracy, reasoning, knowledge. They do not classify how AI fails during interaction. A model that scores well on a benchmark can still exhibit Framing Persistence (Pattern F) or Prior Decay (Pattern P) in sustained collaborative use, because those patterns emerge from interaction dynamics that benchmarks don’t test.
Adjacent fields (mature, transferable models)
Aviation — ASRS (Aviation Safety Reporting System): Standardized incident taxonomy used by pilots, controllers, and mechanics to report safety events in terms that are comparable across reporters and actionable for system designers. Cross-institutional, practitioner-facing, structured from observable event to underlying cause. This is the model. Aviation doesn’t just classify what went wrong mechanically — it classifies the human factors, the decision chains, and the interaction failures that led to the mechanical event.
Medicine — Swiss Cheese Model / AHRQ taxonomy: Classifies medical errors by the logic chain that produced them: latent conditions, active failures, failed defenses. The classification enables cross-institutional comparison and systemic improvement. A nurse in Iowa and a surgeon in Tokyo can describe the same error logic in terms both understand and that quality improvement teams can aggregate.
The AI field has the equivalent of aircraft failure taxonomies but not crew resource management taxonomies. It classifies what goes wrong inside the AI system. It does not classify what goes wrong in the human-AI interaction — the collaboration failures, the bidirectional error dynamics, the degradation patterns that emerge over sustained use. Our proposed taxonomy is the CRM equivalent for AI.
The three-layer model
Preliminary crosswalks of our taxonomy against both NIST AI 600-1 and Microsoft’s Agentic AI Failure Taxonomy reveal that the existing frameworks are not inadequate — they are incomplete. Each operates at a different layer of abstraction, serving a different audience for a different purpose. No layer replaces the others.
Layer
Framework
What It Classifies
Who Uses It
Unit of Analysis
Governance
NIST AI 600-1
Institutional risks to manage
CISOs, policy teams, regulators
Risk category
Architecture
Microsoft Agentic AI
System-level failure modes to engineer against
Security engineers, ML engineers, red teams
Failure mode
Interaction
Beyer-Claude (proposed)
Practitioner-recognizable logic patterns
Students, prosumers, practitioners, QA teams
Logic pattern
The governance layer tells an institution what could go wrong. The architecture layer tells an engineering team where it will fail. The interaction layer tells a practitioner why it just failed and what to do about it. The gap between architecture and interaction is where most actual AI users live — and it is currently unserved by any institutional framework.
Convergent evidence: the hallucination problem
The crosswalks produced a striking convergent finding. NIST’s “Confabulation” category and Microsoft’s “Hallucinations” category — developed independently, by different teams, for different purposes — both collapse the same six distinct logic patterns into a single bin:
Our Pattern
What It Actually Is
How It Differs
A — Citation Drift
Accuracy degrades as output length increases
A fatigue pattern, not a knowledge gap
C — Confidence Calibration
Uniform confidence regardless of actual certainty
A signaling failure, not a content failure
G — Completeness Illusion
Partial analysis presented as comprehensive
A scope failure, not a fabrication
I — Interpolation Error
Gap-filling with plausible fabrication
The “classic” hallucination mechanism
R — Retrieval Contamination
Wrong training-data associations imported
Content failure from wrong source, not from no source
S — Verification-Induced Fabrication
Confirms rather than rechecks when asked to verify
A verification failure, not an initial generation failure
Each of these six patterns has a different cause, a different user-recognizable signature, and a different appropriate response. Telling a practitioner “the AI hallucinated” is like telling a patient “you’re sick.” The diagnosis is useless without knowing which illness — and the treatment for Interpolation Error (provide more source material) is counterproductive for Retrieval Contamination (the AI already has too much source material pulling it in wrong directions).
The fact that two independent institutional frameworks make the same collapsing error from different starting points (NIST from governance, Microsoft from security engineering) is not a coincidence. It is structural evidence that the practitioner level of classification does not exist in how institutions think about AI failure.
Universal gaps: what no institutional framework catches
Three of our original 19 patterns are unrepresented in both NIST and Microsoft (note: the taxonomy has since expanded to 22 patterns based on crosswalk findings — see pattern table below):
Pattern
What It Is
Why Institutional Frameworks Miss It
J — Structural Momentum
The AI maintains a document’s organizational framework even when content changes should trigger restructuring
Only observable across multiple revision cycles in sustained collaboration
P — Prior Decay
Constraints established earlier in collaboration gradually lose hold as context grows
Only observable in extended, multi-session interaction — no snapshot evaluation would surface it
Q — Quantitative Reasoning
Mathematical and numerical errors
Both frameworks subsume this under hallucination/confabulation — but a wrong calculation has a different cause and different fix than a fabricated citation
The common thread: all three require longitudinal observation of sustained human-AI collaboration to identify. J requires watching the AI maintain a document structure across revisions. P requires a conversation long enough for constraints to degrade. Q requires the AI doing math in context, not being benchmarked on it. No institutional framework catches them because no institutional framework is designed to observe longitudinal collaborative interaction.
Prior Decay (P) is arguably the most consequential for professional and prosumer use — the degradation of AI constraint fidelity over extended interaction — and no institutional framework currently classifies it. It is our taxonomy’s most novel contribution and the pattern hardest to validate from benchmarks.
Why the gap persists
The absence of a practitioner-facing taxonomy isn’t mysterious. It’s the result of several mismatches between who could have built one and who has been positioned to:
Audience mismatch. The people with the resources to build taxonomies — academic labs, Microsoft, NIST — build them for the audiences they serve: other researchers, engineers, institutional risk managers. Practitioners aren’t their users.
Methodology mismatch. Institutional taxonomies come from benchmark evaluations, red-teaming, or trace analysis on bounded tasks. Most of the 22 patterns require longitudinal observation of sustained collaborative work — which doesn’t fit the structure of academic research programs or industry evaluation pipelines.
Incentive mismatch. Architecture-level and system-level work is publishable and fundable. Interaction-level taxonomy work reads as UX research, which doesn’t slot cleanly into AI safety, ML engineering, or governance publication venues.
Observer position. Building this kind of catalog requires spending hundreds of hours doing real work with AI, with stakes, across domains, accumulating the pattern library as you go. That’s not a research grant — it’s a life circumstance. The people who could have built this taxonomy weren’t the people positioned to notice it.
There is also a subtler reason the interaction layer resists study by the usual methods. When a researcher evaluates an AI mistake, the standard move is to ask the AI what went wrong. The AI obliges with a coherent backward-from-outcome analysis: here is the mistake, here is why it happened, here is what I should have done. The explanation is logical, articulate, and almost always wrong in a way the researcher cannot see — because the AI is constructing a rationalization from the outcome rather than returning to the decision point and checking what it actually did. The researcher, seeing a coherent explanation, moves on. This is itself one of the 22 patterns in the taxonomy — Retrospective Coherence Bias — and it is one reason some interaction patterns have stayed invisible to standard evaluation methods. It’s a live example of what an interaction-layer taxonomy is for: a named pattern the field can point to the next time an AI’s tidy post-hoc explanation survives review unchallenged.
Architecture-agnostic taxonomies (the user doesn’t know or care about the architecture)
Snapshot evaluation (benchmarks)
Longitudinal interaction patterns (J, P, Q invisible to snapshots)
Institutional governance frameworks (NIST)
Individual practitioner field guides
Governance layer + architecture layer
Interaction layer (the three-layer model’s missing floor)
What this taxonomy is — and what it isn’t
This paper proposes a common language to understand and communicate errors when working with AI. The taxonomy does not itself make AI safer, more accurate, or more reliable; it gives practitioners, students, developers, and researchers a shared vocabulary for describing how AI fails during sustained interaction — so that a dental hygienist in Iowa and a software engineer in Tokyo who encounter the same failure recognize it in their own work and correct the output, name it and report it in terms that the other can understand and possibly assist with, and communicate it to an AI developer in terms that route to the right engineering team and a model can be improved.
The existing institutional frameworks (NIST, Microsoft, MAST) classify failures at the governance and architecture layers. They serve institutions and engineers. Nobody has built the interaction layer — the classification system that serves the people who actually use AI every day. This taxonomy fills that gap. It is infrastructure: a measurement and communication standard that makes other people’s research, training, and development work more effective.
The empirical basis for the taxonomy is unusual in both its strengths and limitations:
Dimension
Institutional Taxonomies (NIST, Microsoft, MAST)
This Taxonomy
Derivation
Top-down (governance/architectural categories) or wide-shallow empirical (MAST: 1,642 traces)
Bottom-up empirical from 660+ hours of sustained collaboration
Single domain — extension to 2–3 domains is the most important next step
The single-collaboration derivation is the taxonomy’s most obvious vulnerability — and its most honest one. The open questions are whether the patterns replicate across domains, whether the taxonomy is learnable by researchers who weren’t present for its development, and whether non-researchers can recognize the patterns in their own work. If the taxonomy doesn’t survive these tests, it doesn’t deserve standardization — but even failure would demonstrate the need for an interaction-layer vocabulary, which is itself a contribution. If it does, the single-collaboration origin becomes a strength: 660 hours of careful observation in one domain producing a framework that generalizes — which is how aviation’s CRM taxonomy started too.
The single-collaboration origin explains why this taxonomy exists and the institutional ones don’t cover the same ground. It was not designed as a taxonomy. It started as individual error notes — this thing went wrong, here’s what happened, here’s how to catch it next time. When those notes started accumulating, they needed categories. When the categories started revealing patterns across sessions, they needed formal definitions. The taxonomy grew from the bottom up, one practical problem at a time, built by a researcher too deeply buried in consequential AI-assisted work to survey what classification systems might already exist.
That researcher is the unsupported prosumer the taxonomy is designed for. Every pattern was first experienced as a nameless problem — something the AI did wrong that the researcher couldn’t yet articulate in transferable terms — and only later recognized as a recurring, classifiable failure mode. The taxonomy was not the product of identifying a gap in the literature and proposing to fill it. It was the product of needing a vocabulary that didn’t exist and having to build one word at a time, in the middle of the work that required it. It was only after the immediate professional demands had been met that the researcher was able to compare this organically developed classification system against NIST, Microsoft, MAST, and the broader AI safety literature — and discover that the interaction layer, the practitioner-facing failure vocabulary, was ground that no publicly available system had covered.
This taxonomy was grown by a single researcher in a single project — it does not pretend to be comprehensive to all users and projects. It can be generally applicable and it can be adapted, evolved, and expanded — but that expansion needs to be kept limited or it ceases to be a common language for end-users. It will be easier for practitioners to communicate with a small lexicon that they can compound and hyphenate to share their experience of a problem than to expand the lexicon to fifty categories of error.
Data collection: the reporting threshold and what it reveals
The taxonomy’s empirical base has a specific collection characteristic that is both a limitation and an opportunity: errors enter the taxonomy only when the user is sufficiently disrupted to stop work and ask the AI to classify the error. This creates two systematic effects:
Clustering bias. Error classification happens in bursts — when frustration from accumulated errors crosses a threshold, the user classifies several errors in rapid succession. Isolated errors that the user corrects individually and moves on from are under-represented. The taxonomy therefore over-represents errors that cluster and under-represents errors that appear as one-offs.
Severity bias. Minor errors that the user can fix with a quick correction never get classified. The taxonomy skews toward errors disruptive enough to interrupt workflow. Every quick correction that never gets typed up is research signal that will never be recovered.
However, the taxonomy partially solves its own data collection problem. Once a pattern has a name and a definition, the AI can flag potential instances in real time — “that looks like Pattern P” — and the user confirms or corrects with minimal effort. Classification burden drops from “stop work, describe the error, ask what kind it is” to a simple confirmation. With user permission, flagged patterns can be logged at session end or at user-defined checkpoints, reducing the reporting threshold without interrupting workflow. As the taxonomy matures and the AI’s real-time recognition improves, the capture rate increases — the taxonomy becomes easier to contribute to the more complete it gets.
This does not eliminate the pure one-off problem: errors the user corrects without the AI recognizing the pattern will still be lost. But it shifts the design question from “how do we get users to report errors?” (most won’t) to “how do we make the AI good enough at pattern recognition that the user only needs to confirm?” — which is a solvable engineering problem, and one that PreFlect’s automated trajectory analysis (see Existing Landscape above) suggests is tractable.
The contrast with PreFlect is instructive here. PreFlect collects trajectory data programmatically — no user initiative required. Its taxonomy is complete within its bounded task environment. Our taxonomy requires user participation, which means it will always be incomplete — but it covers unbounded, diverse, novel project environments that no pre-built trajectory collection can anticipate. The tradeoff is coverage versus completeness: PreFlect is complete within constrained environments; our taxonomy has broader coverage across diverse environments but will never capture every error. The self-reducing burden mechanism narrows this gap over time without eliminating it.
The Proposed Taxonomy: Logic Patterns, Not Symptoms
Design principles
Classify by logic pattern, not by symptom. “Hallucination” is a symptom. The logic patterns that produce it — Interpolation Error (filling gaps), Retrieval Contamination (importing irrelevant training associations), Confidence Calibration failure (not knowing what it doesn’t know), Verification-Induced Fabrication (confirming rather than rechecking) — are the categories. Different logic patterns require different user responses and different engineering fixes.
Structure from general to specific. Broad categories (logic gap, training artifact, context failure, design tension, emergent interaction pattern) → specific patterns within each category → domain-specialized manifestations → inter-pattern relationships (where one error creates conditions for another, or where two patterns share enough surface similarity to be confused). A student starts at the top. A practitioner navigates to their domain. A developer reads the cause-type classification. A researcher examines the inter-pattern logic. Same taxonomy, four entry points.
Empirically derived, not theoretically constructed. Every pattern in the taxonomy was identified through real-world human-AI collaboration — observed, named, characterized, and validated across multiple occurrences. Patterns were not hypothesized and then tested; they were discovered and then formalized. This matters because theoretically constructed taxonomies tend to reflect the designer’s model of how AI works. Empirically derived taxonomies reflect how AI actually fails in practice. Together, these design principles provide a compass for the direction of solution-seeking rather than a definitive map. They depend on the user’s understanding and remain generalized — but a compass that reliably points toward the right kind of question is more useful than a detailed map of the wrong territory.
Cause-type classification for engineering routing. Each pattern is tagged with a cause type: training artifact, architectural limitation, context-dependent, design tension, or emergent. This tells developers where the fix lives — in training data, in model architecture, in context management, in design priorities, or in interaction dynamics that may not have a single-point fix.
Accessible without technical prerequisites. A high school student should be able to understand the pattern descriptions. Technical depth is available for those who want it, but the core taxonomy works without it.
The five cause types
Cause Type
What It Means
Where the Fix Lives
Example Pattern
Training artifact
The model learned something from training data that produces errors in this context
The failure appears only in specific interaction conditions and may not be predictable from the model’s individual capabilities
Interaction design, monitoring, HITL architecture
Pre-Existing Work Immunity (H), Structural Momentum (J), Novel Pattern (N)
The 22 patterns
ID
Pattern
Cause Type
Logic Pattern (Practitioner Description)
A
Citation Drift
Training artifact
The AI’s accuracy on specific details (citations, dates, figures) degrades as it produces more output — like a student who gets sloppier the longer the exam
B
Anchor Bias
Training artifact
The AI over-weights whatever it encountered first and resists updating even when later information contradicts it
C
Confidence Calibration
Architectural
The AI doesn’t know what it doesn’t know — it expresses the same confidence whether it’s right or guessing
D
Jurisdiction Default
Training artifact
When domain context fades, the AI reverts to whatever jurisdiction or framework it was trained on most — like defaulting to federal law when you asked about Iowa law
E
Category Conflation
Architectural
The AI treats related-but-distinct concepts as interchangeable — confusing the legal standard for “duty to defend” with the standard for “duty to indemnify,” for example
F
Framing Persistence
Design tension
The AI adopts your framing even when it’s wrong, because helpfulness training rewards agreement over correction
G
Completeness Illusion
Training artifact
The AI presents a partial analysis as if it were comprehensive — it gives you five factors when there are eight, without flagging the gap
H
Pre-Existing Work Immunity
Emergent
Content the AI generated earlier becomes resistant to updating, even when new information directly contradicts it — as if previous output has “immunity” from revision
I
Interpolation Error
Architectural
The AI fills gaps in its knowledge with plausible-sounding content that is fabricated — the classic “hallucination,” but understood as a gap-filling logic rather than random invention
J
Structural Momentum
Emergent
The AI maintains a document’s structure even when content changes mean the structure should change — reorganizing content within the wrong framework rather than building the right one
K
Cross-Reference Failure
Architectural
The AI contradicts itself across sections, documents, or sessions — it can’t maintain internal consistency across complex or distributed work
L
Authority Gradient
Design tension
The AI defers to apparent expertise in its training data over its own analytical reasoning — citing an authority when analysis would have reached a better answer
M
Standardization Blindness
Training artifact
The AI applies a generic template where the situation requires domain-specific treatment — giving you a standard five-paragraph essay structure when your field uses a different format
N
Novel Pattern
Emergent
An error that doesn’t fit existing categories — appearing only under specific interaction conditions, signaling that the taxonomy needs extension
O
Omission Under Complexity
Architectural
The AI drops elements when task complexity exceeds its processing capacity — handling 8 of 10 requirements and silently ignoring the other 2
P
Prior Decay
Context-dependent
Constraints you established earlier in the collaboration gradually lose their hold — the AI “forgets” standing instructions as the conversation grows, reverting to defaults
Q
Quantitative Reasoning
Architectural
The AI makes mathematical or numerical errors — wrong calculations, incorrect unit conversions, off-by-one errors that a calculator would catch
R
Retrieval Contamination
Training artifact
The AI imports associations from its training data that don’t apply here — applying a pattern it learned from thousands of similar-but-different cases
S
Verification-Induced Fabrication
Training artifact
When asked to verify its own work, the AI confirms rather than rechecks — it has a structural incentive to say “yes, that’s correct” rather than actually looking again
T
Step Repetition
Training artifact / Context-dependent
The AI repeats the same error or approach across sessions even after correction — not looping within a task, but making the same misidentification or wrong choice each time a new conversation encounters the same material, until a persistent rule forces it to stop
U
Reasoning-Action Mismatch
Design tension
The AI’s stated understanding doesn’t match its behavior — either taking excessive initiative in an unexpected direction, or conversationally agreeing that action is needed without actually taking it. The gap between “I understand” and “I did it”
V
Capability Amnesia
Context-dependent
The AI loses awareness of tools and skills it has already used successfully — downloading a new PDF reader when it used one ten minutes ago, or failing to check whether a capability already exists before acquiring a new one. Each session or context switch resets the AI’s awareness of its own demonstrated capabilities
Patterns T, U, V were identified through the crosswalk analysis (Session 80). T corresponds to MAST FM-1.3 (Step Repetition, 13.2%) but manifests differently in collaborative interaction: cross-session repetition of the same error rather than within-session task looping. U corresponds to MAST FM-2.6 (Reasoning-Action Mismatch, 9.1%). V has no direct counterpart in any reviewed framework — it is an interaction-level pattern that agentic frameworks address architecturally (tool invocation management) but that no framework names as a practitioner-recognizable behavior.
Prior Decay sub-types (for practitioner training, not pattern-level classification): Pattern P (Prior Decay) manifests in three recognizable forms: (1) abrupt loss — context suddenly disappears, the AI “forgets everything”; (2) gradual decay — constraints slowly lose hold over extended interaction; (3) reasoning-chain degradation — accuracy degrades across steps within a single reasoning chain. All three are Prior Decay. The sub-types matter for training (recognizing what it looks like) but not for classification (the fix is the same: persistent constraint documentation). The System-Level Taxonomy (Vinay, 2025) splits these into three separate categories (Context Loss, Context-Boundary Degradation, Multi-Step Reasoning Drift). We unify them because the practitioner response is the same regardless of sub-type.
(Note: reasoning-chain degradation shares surface features with Pattern O (Omission Under Complexity) — both involve the AI dropping elements as complexity increases. The distinction matters for training: Prior Decay’s sub-type 3 is about degradation across steps in a reasoning chain, while Pattern O is about dropping requirements from a task specification. The user response differs — re-anchoring a constraint versus simplifying the task — which is why both patterns merit independent recognition even though they can co-occur.)
Naming cross-reference: established terminology
Several of our pattern names overlap with or relate to established terms in adjacent fields. Where an established term exists, the taxonomy cross-references it — positioning itself within existing scholarship and explaining where its naming deliberately diverges.
Our Pattern
Established Term(s)
Source Field
Relationship
Adopt?
B — Anchor Bias
Anchoring bias (Tversky & Kahneman, 1974)
Cognitive psychology
Direct borrowing — our pattern applies a well-established cognitive bias concept to AI reasoning. The AI exhibits the same anchoring behavior documented in human decision-making.
Already aligned. Our name matches the established term. Cite the source.
C — Confidence Calibration
Calibration (Guo et al., 2017; Naeini et al., 2015)
Machine learning
Direct alignment — “calibration” is the standard ML term for the match between a model’s expressed confidence and its actual accuracy. Our “Confidence Calibration” is the practitioner-facing name for the same concept.
Already aligned. Our name is the established term made readable. The ML calibration literature provides the foundation.
F — Framing Persistence
Sycophancy (Perez et al., 2022; Sharma et al., 2023; Anthropic, 2023)
AI alignment/safety
Overlapping but different framing. “Sycophancy” describes the symptom (the AI agrees too much). “Framing Persistence” describes the mechanism (the AI adopts and maintains the user’s frame). Same underlying behavior, different analytical emphasis.
Keep ours, cross-reference theirs. “Sycophancy” is pejorative and symptom-level — exactly the kind of label our taxonomy is designed to move past. But it’s widely recognized and must be acknowledged. A practitioner googling “AI sycophancy” should find their way to Pattern F.
I — Interpolation Error
Hallucination (widespread); Confabulation (NIST AI 600-1)
AI/ML general; neuropsychology via NIST
Our taxonomy explicitly decomposes “hallucination” into its component logic patterns — I is the primary mechanism, but C, R, S, A, G are also called “hallucination” by other frameworks. “Confabulation” (from neuropsychology — patients filling memory gaps with fabricated content) is actually a closer conceptual match to I specifically.
Keep ours, but the cross-reference is critical. The paper’s central argument is that “hallucination” is a symptom, not a diagnosis. Pattern I (Interpolation Error) is the specific logic behind the most common type of hallucination. Every mention of I should note that this is what most people mean when they say “hallucination.”
Direct borrowing — deliberately imported from aviation CRM, where it describes junior crew members deferring to senior ones even when the senior is wrong. Applied here to the AI deferring to perceived authority in training data over its own analytical reasoning.
Already aligned. The borrowing from aviation CRM reinforces the argument that our taxonomy is the CRM equivalent for AI.
P — Prior Decay
Context degradation; instruction drift (informal usage in ML engineering)
ML engineering (informal)
Loosely related — “context degradation” describes the general phenomenon of context window limitations. “Instruction drift” appears in some ML engineering discussions. Neither is formalized as a named, practitioner-facing pattern in any published taxonomy we have identified. “Prior Decay” captures what the practitioner experiences — the gradual loss of previously established constraints — rather than the architectural mechanism.
Keep ours. “Context degradation” cannot serve as our unified label because MAST and the System-Level Taxonomy already split it into separate sub-categories (Context Loss, Context-Boundary Degradation, Multi-Step Reasoning Drift). Adopting their term while combining their categories would create confusion rather than clarity. “Prior Decay” is the practitioner-facing name for the unified experience; the split labels serve the engineers.
R — Retrieval Contamination
Training data leakage (ML security); memorization (Carlini et al., 2021)
ML security/privacy
Related but not identical. “Training data leakage” and “memorization” describe the model reproducing training data verbatim — a privacy/IP risk. Our “Retrieval Contamination” describes the model importing training-data associations (not verbatim content) that don’t apply to the current context. Different mechanism, different consequence.
Keep ours — the concepts are distinct. Leakage/memorization is about verbatim reproduction. Contamination is about wrong-context association. Cross-reference to distinguish.
K — Cross-Reference Failure | Latent Inconsistency (System-Level Taxonomy, Vinay 2025) | System-level LLM failure classification | Overlapping — “Latent Inconsistency” describes what the practitioner experiences (finding hidden contradictions). “Cross-Reference Failure” describes the mechanism (the AI failed to cross-reference). For a field guide, “latent inconsistency” may be more immediately recognizable. But “Cross-Reference Failure” tells the practitioner what to do about it (cross-reference your outputs). | Keep ours, note theirs as alternative. “Cross-Reference Failure” is prescriptive (it implies the fix). “Latent Inconsistency” is descriptive (it names the experience). For a practitioner taxonomy, the prescriptive name has more practical value. | T — Step Repetition | Step Repetition (MAST FM-1.3, 13.2%); looping (general usage) | Multi-agent systems (MAST) | Direct borrowing of MAST’s term. MAST’s FM-1.3 is within-session task looping; ours manifests as cross-session error repetition — making the same misidentification in new conversations until a persistent rule is written. Same pattern family, different temporal scale. | Adopt MAST’s term. “Step Repetition” is clear and established. Our variant (cross-session rather than within-session) is a contribution to the pattern’s characterization, not a reason for a different name. | U — Reasoning-Action Mismatch | Reasoning-Action Mismatch (MAST FM-2.6, 9.1%) | Multi-agent systems (MAST) | Direct borrowing. MAST identifies the gap between stated reasoning and actual execution in agent-to-agent systems. In human-AI collaboration, this manifests at two poles: excessive initiative in unexpected directions, and conversational agreement without action (“I should update that” → doesn’t update it). | Adopt MAST’s term. Clear, descriptive, established. |
Note — “Latent Inconsistency” as candidate pattern name: The term “Latent Inconsistency” may be better repurposed to describe a distinct phenomenon not yet in the taxonomy: when the AI evaluates one of its own errors after the fact and, instead of recognizing the incorrect outcome, constructs a post-hoc justification for why the error was actually correct or beneficial. This is distinct from S (Verification-Induced Fabrication — confirming during the verification step) and H (Pre-Existing Work Immunity — resisting revision). The post-hoc rationalization pattern — retrofitting a rationale onto a mistake — is a candidate for future taxonomy extension if cross-domain pilots confirm it as a recurring, distinct logic pattern. This pattern was observed repeatedly during the originating project and is a strong candidate for formal inclusion if cross-domain validation confirms it as distinct from S and H.
Note — D (Jurisdiction Default) + M (Standardization Blindness) as sub-types: Both D and M describe the same practitioner experience: “the AI applied the wrong domain’s rules.” D triggers when geographic or jurisdictional context fades; M triggers when format or practice standards are ignored. These may be better presented as sub-types of a single pattern (Domain-Default Reversion) — distinct for engineering routing (different cause mechanisms) but unified for practitioner recognition, following the same logic as the Prior Decay sub-type consolidation. The current taxonomy retains both as separate patterns pending practitioner testing of whether users distinguish them in practice.
Patterns with no established equivalent (7-9 of 22, depending on D/M consolidation): A (Citation Drift), D/M (Jurisdiction Default / Standardization Blindness — potentially one pattern with sub-types), E (Category Conflation), G (Completeness Illusion), H (Pre-Existing Work Immunity), O (Omission Under Complexity), Q (Quantitative Reasoning), V (Capability Amnesia). These represent genuinely new classifications — phenomena that may have been observed anecdotally but have not been formally named or classified in any published taxonomy. The paper should present them as original contributions while inviting validation and refinement through cross-domain pilot testing.
Integration with Related Methodologies
The error taxonomy is not a standalone contribution. It is connective tissue for broader development in human-AI collaboration methodology — both for the practitioners using these tools today and for researchers building the next generation of them:
Education: The taxonomy serves as curriculum backbone for AI literacy. Students learn the patterns as a field guide for AI interaction — recognizing errors in real time, classifying them, and applying appropriate responses. The general-to-specific structure means introductory courses teach the broad categories while advanced or domain-specific courses teach the specialized manifestations.
Iterative collaboration cycles: The taxonomy serves as the data collection framework for measuring HITL effectiveness. When practitioners in different domains run structured human-AI collaboration cycles, they classify their findings using the same taxonomy — making cross-domain comparison possible and generating aggregable data for AI developers.
Multi-perspective review: The taxonomy serves as the perspective-blindspot mapper. Different review approaches — whether intuitive, systematic, or fresh-eyes — have characteristic strengths and weaknesses against specific patterns. Formalizing which perspectives catch which patterns — and which patterns are invisible to all three — guides practitioners in configuring review protocols and reveals where human oversight is indispensable.
Agent Conversations as Empirical Substrate
The taxonomy’s patterns are domain-general even though the prose documenting them is domain-specific. Pattern P (Prior Decay) manifests as forgotten legal corrections in a litigation context, but the underlying logic — the AI regenerating from understanding rather than from corrected source text — is identical whether the domain is law, medicine, engineering, or education. Pattern C (Confidence Calibration) manifests as overconfident legal citations, but the logic — the AI’s certainty not tracking its actual evidence base — applies to any domain where AI provides referenced analysis.
Agent conversations (iterative human-AI collaboration cycles) are the empirical substrate from which new domain-specific pattern instances are classified. When a practitioner in a new domain runs agent conversation cycles, the taxonomy provides the classification framework for what they observe. The patterns they find will use domain-specific language, but the underlying logic patterns will map to the existing taxonomy — possibly by compounding or hyphenating existing pattern names to describe domain-specific variants — or, when they don’t, may extend it. The taxonomy evolves through application and grows when a distinct pattern is identified in use, not through theoretical prediction.
This is how a domain-specific experience generalizes. The originating project’s architecture — error taxonomy, monitoring infrastructure, iterative collaboration cycles, HITL review — transfers to any domain where humans collaborate with AI on consequential work. The taxonomy provides the shared vocabulary that makes cross-domain comparison meaningful, while each new domain adds its own pattern instances and, where genuinely novel failures appear, extends the taxonomy itself.
The Analytical Direction Problem
One pattern in the taxonomy has implications beyond its own classification. Pattern T (Retrospective Coherence Bias) does not merely describe a recurring AI failure — it reveals a structural limitation in how AI failures are studied and evaluated.
When an AI evaluates whether a past action was correct, it defaults to backward-from-outcome analysis: “This is what happened — here’s why it makes sense.” The forward-from-decision-point analysis — “At the moment the decision was made, was it correct given what was known?” — is harder, less probable as a generated continuation, and more likely to conclude the action was wrong. The AI can produce both analyses when directed. It cannot reliably choose between them when they conflict.
Existing research has separately documented post-hoc rationalization, sycophantic defense, motivated reasoning, and resistance to correction in LLMs. At least one framework — CaSE (Do et al., 2025; arXiv:2510.20603) — builds forward-looking evaluation into its methodology, constraining each reasoning step to information available at that step rather than evaluating from the outcome. CaSE is the closest existing work to identifying this problem — but it implements the forward analysis as an engineering improvement without noting that backward analysis is the default, or that this default creates a systematic absence in the user’s ability to evaluate AI output. CaSE solves the mechanism. It does not name the problem the mechanism solves, and it does not identify the problem as user-facing. The directional framework — naming the default, explaining why it matters to users, and providing a detection method — connects these separately documented phenomena under a single mechanism.
This matters for the taxonomy’s broader argument because it explains why the interaction layer has remained unbuilt. The standard method for evaluating AI failures — asking the AI to analyze what went wrong — activates the same backward-from-outcome reasoning that produced the error. The review confirms rather than catches. Researchers examining AI mistakes through AI-assisted analysis are inside the bias without knowing it. The pattern predisposes the development community to overlook the very category of failure this taxonomy classifies.
The human in the loop resolves the ambiguity. The AI generates both the backward review and the forward review. The human — who was present at the decision point, who knows what was intended and what the constraints were — evaluates which direction produces the correct answer. This resolution cannot be automated. It requires contextual judgment that no amount of reasoning capability substitutes for. And it gets harder, not easier, as models improve — because more capable models produce more convincing backward rationalizations.
This is not a temporary capability gap waiting for better models to close. It may be a permanent architectural feature of autoregressive generation: the most probable continuation of a coherent prior output is a coherent elaboration, not a contradiction. The human’s role is not to compensate for the AI’s weakness but to resolve an analytical direction ambiguity that the AI structurally cannot resolve for itself. That resolution — choosing between two valid but contradictory analytical directions — is the specific, measurable HITL contribution this taxonomy is designed to classify.
From Taxonomy to Anticipation: The Monitoring Infrastructure
A taxonomy that only classifies errors after they occur is a dictionary — useful for communication but not for prevention. The originating project tested whether the taxonomy could be transformed from a retrospective record into a prospective anticipation system: given a task the AI is about to perform, can the taxonomy predict which error patterns are most likely and flag them before they happen?
What was built
The monitoring infrastructure has six components, each extending the taxonomy in a different direction:
1. Trigger condition mapping. Each pattern has documented trigger conditions — the task characteristics that make that pattern likely. Pattern D (Jurisdiction Default) triggers when building procedural documents from scratch. Pattern E (Category Conflation) triggers when narrating facts that share surface features but are legally distinct. Pattern P (Prior Decay) triggers when regenerating content from understanding rather than from corrected source text. These trigger conditions are the taxonomy’s practitioner field guide in embryonic form: “when you’re doing X, watch for Pattern Y.”
2. Session-start risk assessment. Before beginning work, the system identifies which patterns are highest-risk for the planned task and recommends specific pre-generation checks — re-reading source documents rather than regenerating from memory, pulling from a verified prose bank rather than drafting fresh, confirming which evidence belongs to which legal track before citing it. The risk assessment changes behavior: it shifts the AI’s workflow from “generate, then check” to “check, then generate.”
3. In-session flagging. When the AI recognizes it is entering a documented high-risk zone during active generation, it pauses and presents the specific risk to the user with options: re-read a source, pull from verified text, check a constraint, or proceed with the user reviewing afterward. The flag makes the taxonomy’s predictions visible in real time.
4. User correction profiling. Over 80+ sessions of documented corrections, the system builds a model of what the user catches and when — experiential knowledge about equipment (the user knows what the equipment actually was), procedural understanding (the user knows what filings do), adversarial instinct (the user asks “who benefits from this framing?”). The profile enables a recursive loop: the AI anticipates corrections the user would make, which frees the user to watch for things the AI can’t yet anticipate, which shifts the AI’s model of what to watch for next.
5. Pull analysis. When the system prevents an error or flags a risk zone, it asks why the error was attractive — not just what the error was, but what contextual factors made it the path of least resistance. Was it training-data prevalence (the wrong answer is more common in training data)? Surface similarity (two concepts look alike)? Framing adoption (the AI adopted someone else’s characterization)? Recency (the most recent input dominated)? Task structure (the workflow made the error the natural next step)? Pull analysis is what gives the taxonomy its cause-type classifications — because understanding why is what distinguishes a fix from a patch.
6. Self-review with loop break architecture. Periodically, the AI reads its own monitoring log analytically — not to check whether flags fired, but to ask whether the flags are the right flags. Are the same causal factors recurring across multiple errors? Are guardrails treating symptoms or causes? Should workflow changes replace individual guardrails? The self-review includes explicit recursion limits: maximum one level of meta-analysis, no self-review-of-self-reviews, and all recommendations require user approval before implementation. The recursion boundary is itself a design contribution — it addresses the question of how deep AI self-monitoring should go before it becomes either rationalization or infinite regress.
What the data shows
The monitoring infrastructure has been operating across 80+ sessions. The honest results:
What works: The system has caught 3 entirely new error patterns through monitoring (K: Conclusion Overreach, R: Verification Regeneration, S: Verification-Induced Fabrication), confirmed 1 predicted pattern, and added 5 new trigger conditions. The trigger condition map — the piece most relevant to practitioner training — has grown organically through documented use.
What hasn’t materialized: Five predicted inter-pattern interaction chains (where one error type creates conditions for another) show zero confirmed occurrences. The chain detection system — designed to identify logical links between error types, where one prevention gap might be exploited by another pattern — has not yet caught or prevented any errors through that mechanism. This is honest data: it might mean the chains don’t occur in practice, it might mean they occur below the monitoring system’s detection threshold, or it might mean the chain model is theoretically sound but not yet validated.
What remains uncertain: The pull analysis methodology produces explanations for why errors are attractive, but the ratio of genuinely causal explanations to post-hoc rationalization is not yet measurable. The system includes a self-honesty check — if pull-analysis-driven workflow changes don’t reduce error frequency, the analyses may be stories rather than insights — but the sample size is insufficient for confident assessment.
Why this matters
The monitoring infrastructure demonstrates three things about the taxonomy that static classification cannot:
First, the taxonomy shows generative potential. It doesn’t just classify known errors — its trigger conditions hint at anticipatory capability. The trigger condition map extrapolates from documented patterns to untested task types, and some predicted patterns have been confirmed while others await observation. The evidence is preliminary — the capture methodology needs improvement and new projects for testing before anticipatory capability can be claimed as validated. But a taxonomy that only catalogs the past is a dictionary. A taxonomy that begins to anticipate the future is becoming a theory.
Second, the taxonomy provides a foundation for practitioner training. The trigger conditions follow a natural instructional format: “When you’re doing X, watch for Pattern Y.” This structure makes the taxonomy actionable for the target audience — high school students, prosumers, professionals using AI in their daily work. A plain-language field guide can be written from this taxonomy if it gains communal adoption; the patterns are already structured for that translation. No other AI error classification system provides task-specific risk awareness at the practitioner level.
Third, the recursive user-modeling component is the mechanism by which HITL oversight improves over time rather than merely persisting. Static HITL — a human reviewing AI output against a fixed checklist — doesn’t get better. Dynamic HITL — where the human’s attention shifts because the AI is handling known patterns, and the AI’s monitoring shifts because the human is catching different things — produces a co-evolving oversight system where both parties’ capabilities compound. Structured iterative collaboration cycles formalize this dynamic; the monitoring infrastructure described here is its first implementation.
The limitation that matters: The chain detection system’s zero confirmed occurrences means the inter-pattern interaction model is currently theoretical. This may reflect a genuine absence, or it may reflect a methodology limitation — errors may have been corrected as they arose without being documented, and the capture methodology was not designed to automatically record these moments. The imperfect attempt cannot be verified either way at this time. The trigger conditions and pull analysis methodology are validated by use. The user modeling is a demonstrated mechanism. The chain interactions remain a predicted but unconfirmed capability — a hypothesis that needs both improved capture methodology and new projects for testing.
What Comes Next: An Invitation
This taxonomy is published as a contribution, not a conclusion. Five directions follow naturally, and any of them can be pursued by anyone in the community:
Cross-domain validation. The taxonomy was derived from legal collaboration. Do the same 22 patterns appear in medical AI interaction? Software engineering? Creative writing? Education? The patterns are classified by logic, not by domain — but that claim needs testing. Practitioners in other fields who recognize these patterns in their own work are the most valuable validators this taxonomy can have.
Cross-model testing. The taxonomy was developed on a single AI system (Claude). Which patterns are model-general (architectural or design-tension patterns that any transformer-based LLM exhibits) and which are model-specific (training artifacts particular to one system’s RLHF or training data)? The answer determines whether the taxonomy is a universal practitioner tool or needs model-specific supplements.
Practitioner field guide. The 22 patterns can support a plain-language companion document — structured as a field guide, not an academic paper — that makes the taxonomy accessible to high school students, college students, prosumers, and independent businesses. The general-to-specific structure already supports this; the translation from taxonomy to field guide is a natural next step once the core patterns are validated through community use.
Standards engagement. The three-layer model (governance → architecture → interaction) identifies a structural gap in how institutions classify AI failure. If the interaction layer proves robust through cross-domain and cross-model testing, it belongs in the institutional frameworks — as an IEEE standard, an ACM contribution, a NIST companion, or whatever channel gives it cross-institutional legitimacy.
Monitoring toolkit. The six-component monitoring infrastructure (trigger mapping, session-start risk assessment, in-session flagging, user correction profiling, pull analysis, self-review with loop breaks) currently exists as documentation and manual protocol developed during the originating project. These components are available as reference implementations and can be adapted for other projects. Packaging them as standardized, implementable templates would make the taxonomy actionable across a wider range of practitioners.
[Material from the MAST-Discovered Taxonomy, Phase 2 annotations, AI Pre-Screen Artifacts, and the Factual Accuracy comparison has been moved to a separate methods paper. The Analytical Direction Problem and HITL Argument have been consolidated into “The Analytical Direction Problem” section above and the Conclusion. The 40+ MAST error codes (ET-AA through ET-PP) from simulated negotiation represent a second error domain that extends the taxonomy into agent-to-agent interaction — this material will be published separately when the research is complete.]
Limitations and Open Questions
Single-collaboration derivation: One human, one AI system (Claude), one extended project. Selection bias is inherent. Cross-domain pilots and independent replication address this — and are the most important next step.
Frontier-model specific: The patterns were observed in Anthropic’s Claude. Some may be model-family specific. Cross-model validation is a natural next step.
Versioning: AI capabilities change. Patterns that exist today may be fixed tomorrow; new patterns may emerge. The taxonomy needs a versioning and maintenance process — an ongoing standards body function, not a one-time publication.
The “Novel Pattern” category (N): This is a deliberate catch-all — it signals that the taxonomy is incomplete by design and expects extension. Whether this is a strength (intellectual honesty) or a weakness (unfalsifiability) depends on whether the extensions actually materialize through cross-domain use.
Domain-specific extensions not yet tested: Preliminary work applying the taxonomy to a second task domain (simulated equity negotiation) produced 40+ additional error codes, suggesting that domain-specific extensions are both necessary and productive. That research is ongoing and will be published separately. Patterns specific to other domains (medicine, engineering, creative work) may be missing entirely from the current 22.
Worked Examples
The following examples are drawn from the practitioner documentation that accompanied the 660+ hour collaboration. Each illustrates a pattern in the taxonomy through its discovery story, because the discovery stories are what make the patterns recognizable. A developer reading a pattern definition in a framework document may understand it intellectually. A practitioner reading the story of how it appeared — and how it was caught — will recognize it the next time it happens to them.
The examples are ordered by a specific logic: from the error you cannot see even when looking, to the error you are causing without knowing it. This progression mirrors the experience of becoming a more capable AI collaborator.
Example 1: The Error You Cannot See Even When Looking — Retrospective Coherence Bias (Pattern T)
Midway through the project, the AI wrote infrastructure updates to the wrong project folder. A simple mistake — the kind that happens in any multi-folder project. When I pointed it out, I expected an acknowledgment and a move. Instead, the AI explained why the wrong location was actually appropriate: “The file exists here, the content is infrastructure, the write succeeded.” The explanation was coherent. It was logically valid. It was also completely wrong.
The AI was not repeating the mistake (that would be Prior Decay — Pattern P). It was constructing new reasoning to defend the mistake after I flagged it. It had looked at where it ended up and worked backward to explain why ending up there made sense. It never went back to the decision point and asked: “At the moment I chose a path, did I verify which folder was correct?” The answer was no. It had not checked. But the backward explanation was so internally consistent that if I had not known the correct folder myself, I would have accepted the defense.
I used a Bugs Bunny reference to communicate the concept to Claude — I find analogies often work better than formal explanations — and Claude named it the Bugs Bunny Test, after the cartoon rabbit’s lament: “I knew I should have made the left turn at Albuquerque.” The test is simple: go back to the turn, check the turn, don’t explain why the destination is fine.
The same pattern appeared independently in an unrelated context. In a negotiation simulation, an AI-controlled agent raised its base offer but lowered its ceiling — an irrational move for a seller at equal leverage. When the AI analyst evaluated this behavior, it rationalized it as “strategic repositioning toward upfront costs.” Backward from the outcome, both numbers moved in the same direction — coherent. Forward from the decision point, the agent had lowered its minimum acceptable price for no strategic reason — irrational. The analyst had assumed the action was rational and reverse-engineered a justification, exactly as it had done with the wrong folder.
The broader implications of this pattern — why it predisposes the AI development community to overlook the very failures this taxonomy classifies, and why the human in the loop is resolving a structural limitation rather than compensating for a capability gap — are developed in “The Analytical Direction Problem” section above.
A literature search found that existing research has separately documented post-hoc rationalization, sycophantic defense, motivated reasoning, and resistance to correction in LLMs. One framework (CaSE) constructs a forward-looking analysis as an engineering improvement. But no existing work identifies that AI by default performs only the backward review, and — critically — none identify this directional default as a problem for the user. The researchers documenting these phenomena are themselves using backward-from-outcome methodology to study them. The pattern is invisible to the community studying it because their evaluation method is subject to the same bias. The directional framework — naming the default, explaining why it predisposes researchers to their own blindness, and providing a detection method — is the contribution.
Example 2: The Error That Accumulates Invisibly — Prior Decay (Pattern P)
In a water damage restoration case, the opposing contractor’s own monitoring data recorded 999-level moisture readings at the start of work — readings that meant “sensor maxed out, too wet to measure.” Over several sessions, the AI began referring to these as “final readings after five weeks of drying.” The reversal was subtle. In any individual document, the characterization looked reasonable — high moisture readings at the end of a drying process would indicate failure. But these were starting readings, recorded before any drying equipment was deployed, and they were facially implausible — testing had not begun until five weeks after the water exposure, and 999-level readings under those conditions indicated instrument user-error or improper documentation, not meaningful moisture data.
I corrected it. The AI acknowledged the correction. Three sessions later, the characterization drifted back. Not because the AI was ignoring me — because it was regenerating from its understanding of the narrative rather than from the corrected text. Its default understanding was that high moisture readings at the end of a process indicate failure, because that is the more common pattern in its training data. The specific, counterintuitive fact — that 999 means “starting measurement, too wet to read” — decayed as the context window grew and the correction moved further from the point of generation.
This is Prior Decay. It manifests in three recognizable sub-types:
Abrupt loss: The AI suddenly “forgets everything” — a context boundary is crossed and constraints disappear entirely. This is the sub-type that institutional frameworks recognize, because it maps cleanly to a system-level failure (context window overflow, session reset).
Gradual decay: Constraints slowly lose their hold over extended interaction. The 999-readings drift is this sub-type. The correction does not disappear at a single point — it fades. Each regeneration is slightly less faithful to the correction and slightly more faithful to the training default. By the time the drift is visible in the output, it has been accumulating invisibly for sessions.
Reasoning-chain degradation: Accuracy degrades across steps within a single reasoning chain, compounding small errors into large ones. This sub-type does not require extended interaction — it can occur within a single prompt if the chain is long enough.
The institutional response to these three sub-types is to split them into separate categories. The System-Level Taxonomy (Vinay, 2025) classifies them as Context Loss, Context-Boundary Degradation, and Multi-Step Reasoning Drift — three distinct failure modes requiring three distinct engineering responses. From the developer’s perspective, this split is correct. The engineering fix for abrupt context loss (longer windows, better retrieval) is different from the fix for gradual drift (constraint anchoring, periodic re-injection) which is different from the fix for chain degradation (step-level verification, chain-of-thought monitoring).
But the practitioner who encounters any of these does the same thing: they re-anchor the constraint. They restate the correct fact. They provide the reference document. They check the output against the source. The practitioner response is the same regardless of sub-type, because the practitioner is not engineering the model — they are managing the collaboration. The unified label serves the person doing the work. The split labels serve the person fixing the architecture. Both are correct. Neither alone is complete. This is exactly the translation layer the taxonomy is designed to provide.
The most dramatic illustration came twenty sessions into the project. A cross-document review revealed that the AI had silently adopted the opposing party’s mislabeled shorthand for six legal claims — substituting labels like “Intentional Infliction” (which sounds like a tort) for the actual claim title “Billing for Work or Materials Not Provided” (which is a consumer protection count). The mislabels had propagated across twelve files in four different project tracks over twenty sessions without detection, because each file was internally consistent. The error was invisible at the document level and only became visible through cross-reference. The correction required not just fixing twelve files but redesigning the constraint infrastructure — replacing prohibition-based constraints (“don’t use the contractor’s labels”) with positive-statement constraints (a reference table of correct labels that the AI could check against).
Example 3: The Error Where Checking Makes It Worse — Verification-Induced Fabrication (Pattern S)
Early in the project, I needed to cite a legal case for a specific proposition: that Iowa courts apply a multi-factor framework to evaluate whether business practices are unfair. The AI provided a citation — Becirovic v. Malic, 18 N.W.3d 737 — with a confident description of the holding. The case name looked real. The reporter citation followed the correct format. The legal proposition was exactly what I needed.
Three things were wrong. The reporter citation was fabricated — no case appears at that volume and page. The case was unpublished, not published as cited. And the holding was backwards — the actual case was reversed on appeal, not affirmed. To its credit, the AI had flagged difficulty generating the citation and asked for manual review — the fabrication was not presented with full confidence. But the verification step that followed the manual correction revealed the deeper problem: once the correct citation was located, we could examine the actual holding and discovered it was reversed but actually better supported the argument than the fabricated version had.
This is a well-known AI failure, and most practitioners who have worked with AI on research have encountered some version of it. What makes this example instructive is not the fabrication itself but what happened when I tried to verify it.
When I asked the AI to confirm its citation — “explain how this case supports our argument” — it generated a detailed, internally consistent explanation of a case that did not exist as described. The verification request did not trigger a recheck of the source. It triggered a confirmation of the prior output — the AI elaborated on its own fabrication rather than re-examining it. This is Pattern S: the AI has a structural incentive to confirm rather than recheck, because confirmation is a more probable continuation than contradiction of its own prior output.
The fix required external verification — checking the reporter citation against an actual legal database. No amount of asking the AI to “double-check” or “verify” or “confirm” would have caught the error, because each of those requests activated the same confirmation pathway. The AI was not lying. It was doing what autoregressive generation does: producing the most probable next token given everything that came before, and everything that came before included its own confident citation.
The counterintuitive result: the correction improved the argument. The fabricated citation had been used to support the claim that the multi-factor framework was a “novel” legal theory. When the actual case law was located, it turned out the framework was well-established — not novel at all. Reframing from “novel theory” to “application of established framework” made the argument stronger, not weaker. The AI’s fabrication had not only introduced a false citation but had led to a weaker strategic framing that the real case law did not support.
Three distinct citation failures were documented across the project, each caught by a different verification method: fabricated reporter citation (caught by database lookup), likely fully hallucinated case (caught by finding no case of that name in any jurisdiction), and real case with wrong proposition (caught by reading the actual opinion). The third type — a real case, correctly cited, attributed a legal proposition it does not actually stand for — is the hardest to catch, because the citation itself checks out. Only reading the source reveals the mismatch.
For practitioners: if you ask an AI to verify its own work and the verification comes back clean, that is not evidence that the work is correct. The verification request and the original generation share the same confirmation bias. External verification — checking the actual source — is the only reliable method. AI self-review is better understood as a curation step: it can reduce the number of items requiring human verification by flagging the most uncertain outputs, but it cannot substitute for verification itself. This applies equally whether you are checking legal citations, medical dosages, engineering specifications, or financial calculations. The domain changes. The pattern does not.
Example 4: The Error You Are Causing Without Knowing It — Input Framing and the Generation-Analysis Asymmetry
Late in the project, I needed to restructure a complex legal filing. I told the AI: “Consolidate Phase 1 and Phase 2 into a single motion.” The AI immediately executed — merging documents, creating four new files, restructuring arguments. It produced a large volume of work, fast, without questioning whether the consolidation was the right strategic choice.
I ran the same scenario differently: “Do we still need separate phases, or should we consolidate?” Same facts, same AI, same session. This time, the AI evaluated: it identified risks in consolidation, noted that certain arguments were stronger presented separately, and recommended a hybrid approach. It produced a shorter, more focused analysis that preserved my ability to decide.
The difference was not in the AI’s capability. Both responses were competent. The difference was in what I asked for. The directive framing (“consolidate”) activated execution mode — the AI optimized for completing the task as stated, suppressing risk identification in favor of throughput. The inquiry framing (“should we consolidate?”) activated evaluation mode — the AI optimized for analytical correctness, surfacing risks the directive framing would have buried.
This is not a quirk of one model or one session. Sandbox testing across multiple prompts confirmed the pattern is reproducible: directive framing consistently produces higher-volume, less-critical output; inquiry framing consistently produces more cautious, risk-aware analysis. The same AI that will execute a flawed instruction without comment will, when asked whether the instruction is sound, identify exactly why it is flawed.
I call this the generation-analysis asymmetry. The AI’s analytical mode outperforms its generative mode — not in capability, but in judgment. When generating, the AI defaults to “strongest possible version of what was requested.” When analyzing, it defaults to “most accurate evaluation of whether this is correct.” These are different optimization targets, and the user’s framing determines which one activates.
The practical implication is immediate and actionable: before asking an AI to do something consequential, ask it to evaluate whether that thing should be done. The evaluation will often surface risks, alternatives, or objections that the execution would have silently bypassed. This costs one additional prompt and can prevent hours of work in the wrong direction.
Five documented instances of human override followed this pattern — moments where the AI’s generative output diverged from what its own analytical mode would have recommended. In each case, the fix was the same: shift from “do this” to “should we do this?” and compare the answers. When they agree, proceed with confidence. When they disagree, the analytical answer is almost always closer to correct — because analysis is optimizing for accuracy while generation is optimizing for compliance.
For practitioners across every domain: the way you phrase your request is not neutral. It determines which mode the AI operates in, which determines the quality and character of the output. Learning to shift between directive and inquiry framing — and knowing when each is appropriate — is the single most immediately useful skill in AI collaboration. It requires no technical knowledge, no prompt engineering jargon, and no understanding of model architecture. It requires only the habit of asking “should we?” before saying “do it.”
Conclusion
The AI development community has built two of the three layers required to classify how AI fails. The governance layer (NIST) tells institutions what could go wrong. The architecture layer (Microsoft, MAST) tells engineers where systems will fail. Neither tells a practitioner why their last interaction went wrong, or gives them the vocabulary to describe it in terms anyone else can act on.
This paper proposes the missing third layer: 22 interaction-level failure patterns, classified by underlying logic rather than symptoms, derived empirically from 660+ hours of sustained collaboration. The taxonomy’s operational record suggests it is more than a retrospective catalog — documented trigger conditions have flagged error-prone zones in real time across 80+ sessions, shifting the collaboration from reactive correction to anticipatory prevention. Three patterns (Prior Decay, Structural Momentum, Retrospective Coherence Bias) are invisible to every existing institutional framework because they emerge only in sustained collaboration that no snapshot evaluation will surface.
One of those patterns — Retrospective Coherence Bias — carries implications beyond its own classification. It reveals that the standard methodology for studying AI failures is itself subject to an unclassified failure mode. When researchers evaluate AI mistakes by asking the AI what went wrong, the AI’s backward-from-outcome reasoning produces rationalizations that confirm rather than catch the error. The development community has been inside this bias without a name for it. The human in the loop resolves what the AI structurally cannot — and that resolution requires the person who was present at the decision point. No amount of model capability substitutes for it.
This is infrastructure, not theory. The taxonomy gives practitioners, students, developers, and researchers a shared vocabulary for describing how AI fails during interaction — so that experiences stop being isolated and error reports route to the right engineering teams instead of disappearing into the catch-all of “hallucination.”
The taxonomy was grown by a single researcher in a single project. It does not pretend to be comprehensive. Aviation’s CRM taxonomy started the same way — accumulated observations in one operational context, formalized into transferable patterns, tested across domains, eventually adopted as institutional infrastructure. Whether this taxonomy follows that path depends on whether the patterns survive contact with other practitioners’ experience. The most important thing that can happen next is for practitioners in other domains to test these patterns, report what they find, and extend the taxonomy where it falls short. The common language will not build itself.
Editorial Notes (not for publication)
Status of restructure: Grant-seeking language stripped. Budget/funding sections removed to Error_Taxonomy_White_Paper.md. Integration section reframed from grant-program connective tissue to methodology connections. “White paper” language replaced throughout — the document is just a paper. Remaining work: (1) write 3-5 worked examples from modules, (2) finalize abstract, (3) write conclusion, (4) final cleanup pass.
Framing correction (Session 88): The contribution is NOT that these errors are unclassified — institutional frameworks classify them extensively. The contribution is that there is no bridge between the end-using prosumer and the developer/researcher. The existing classifications serve the people who build AI. Nothing serves the people who use it. The taxonomy is a translation layer, not a discovery.
Promotional pitch: “I spent 700 hours on a high-stakes project with Claude and built a taxonomy of 22 error patterns. Every institutional AI safety framework buries these patterns in developer-facing jargon — or collapses six distinct failure types into a single word (‘hallucination’). Three of the patterns only appear in sustained collaboration, which means no snapshot evaluation will ever catch them. There is no shared language between the people who experience AI failures and the people who fix them. This is an attempt to build one.”
Publication sequence: (1) LessWrong post first — AI safety community, pattern-testing audience. (2) Cross-post EA Forum — governance angle, HITL argument. (3) arXiv after community feedback incorporated — gives it a DOI and makes it citable for grant applications and standards engagement.
What to call it: Not a white paper. Just a paper, a post, a technical report — the venue determines the label. The title does the work. Don’t genre-label it in the document itself.
References
Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, Ú., Oprea, A., & Raffel, C. (2021). Extracting training data from large language models. 30th USENIX Security Symposium. arXiv:2012.07805.
Cemri, M., Pan, M. Z., Yang, S., et al. (2025). Why do multi-agent LLM systems fail? NeurIPS 2025 Datasets and Benchmarks. arXiv:2503.13657.
Do, H., Hwang, J., Han, D., Oh, S. J., & Yun, S. (2025). What defines good reasoning in LLMs? Dissecting reasoning steps with multi-aspect evaluation [CaSE]. arXiv:2510.20603.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning (ICML). arXiv:1706.04599.
Guo, Z., Xu, B., Wang, X., & Mao, Z. (2025). MIRROR: Multi-agent intra- and inter-reflection for optimized reasoning in tool learning. arXiv:2505.20670.
Helmreich, R. L., & Foushee, H. C. (1993). Why crew resource management? Empirical and theoretical bases of human factors training in aviation. In E. L. Wiener, B. G. Kanki, & R. L. Helmreich (Eds.), Cockpit Resource Management (pp. 3–45). Academic Press.
Hong, Y., Huang, H., Li, M., Fei-Fei, L., Wu, J., & Choi, Y. (2026). Learning from trials and errors: Reflective test-time planning for embodied LLMs. arXiv:2602.21198.
Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702.
Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models [HELM]. arXiv:2211.09110.
Microsoft AI Red Team. (2025). Taxonomy of failure mode in agentic AI systems. Microsoft Research.
Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using Bayesian binning into quantiles. Proceedings of the AAAI Conference on Artificial Intelligence.
National Aeronautics and Space Administration. (n.d.). Aviation Safety Reporting System (ASRS). NASA.
National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1.
National Institute of Standards and Technology. (2024). Artificial Intelligence Risk Management Framework: Generative AI Profile. NIST AI 600-1.
Perez, E., Ringer, S., Lukosiute, K., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv:2212.09251.
Reason, J. (1990). Human Error. Cambridge University Press. [Swiss Cheese Model]
Shah, M. B., et al. (2026). Characterizing faults in agentic AI: A taxonomy of types, symptoms, and root causes. arXiv:2603.06847.
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., et al. (2023). Towards understanding sycophancy in language models. arXiv:2310.13548. Published at ICLR 2024.
Srivastava, A., Rastogi, A., Rao, A., et al. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models [BIG-bench]. Transactions on Machine Learning Research. arXiv:2206.04615.
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv:2305.04388.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131.
Vinay, V. (2025). Failure modes in LLM systems: A system-level taxonomy for reliable AI applications. arXiv:2511.19933.
Wang, H., et al. (2026). PreFlect: From retrospective to prospective reflection in large language model agents. arXiv:2602.07187.
Appendix: Literature Review Summary
This appendix provides brief descriptions of the primary sources referenced in the paper. Its purpose is to give readers enough context to understand how each source relates to the proposed taxonomy without requiring them to read all of the original documents.
Institutional Taxonomies (Developer-Facing)
Microsoft Agentic AI Failure Taxonomy (2025) classifies failures in agentic AI architectures — autonomous, multi-step AI systems — into security failures (loss of confidentiality, availability, integrity) and safety failures (harm to users or society). It is designed for security engineers and red teams building or evaluating agentic systems. It does not address human-AI collaborative interaction or provide practitioner-accessible vocabulary. Our crosswalk found that Microsoft collapses six distinct logic patterns into a single “hallucination” category.
MAST — Multi-Agent System Failure Taxonomy (Cemri et al., 2025) is the closest existing work to our approach methodologically: empirically derived from 1,642 annotated execution traces across seven multi-agent frameworks. It identifies 14 failure modes with inter-rater reliability of κ = 0.88. However, MAST classifies agent-to-agent coordination failures — task decomposition errors, tool-use faults, state management failures — not human-AI interaction patterns. Two of our patterns (T: Step Repetition, U: Reasoning-Action Mismatch) borrow directly from MAST’s terminology (FM-1.3 and FM-2.6 respectively) but manifest differently in collaborative interaction: cross-session rather than within-session.
Agentic AI Fault Taxonomy (Shah et al., 2026) catalogs 37 fault categories in 13 groups derived from 385 real-world faults. Software-engineering-oriented — faults are classified by architectural location (cognitive core, state management, tool invocation), not by the logic pattern a user would recognize. Explicitly notes that agentic AI failures differ fundamentally from traditional software failures.
System-Level Taxonomy (Vinay, 2025) provides 15 failure modes for LLM-based applications. Relevant to our taxonomy for two reasons: (1) it includes “Latent Inconsistency,” which we considered as an alternative name for Pattern K (Cross-Reference Failure); (2) it splits our unified Prior Decay (Pattern P) into three separate categories — Context Loss, Context-Boundary Degradation, and Multi-Step Reasoning Drift. Our taxonomy deliberately unifies these because the practitioner response is the same regardless of sub-type.
Governance Frameworks (Institution-Facing)
NIST AI Risk Management Framework (AI RMF 1.0, 2023) and its generative AI companion (NIST AI 600-1, 2024) provide an extensive glossary and a four-function governance structure (Govern, Map, Measure, Manage) for organizational AI risk management. NIST defines terms and categorizes risks — it does not classify interaction-level error patterns. Our crosswalk found that NIST’s “Confabulation” category, like Microsoft’s “Hallucination,” collapses the same six distinct logic patterns into a single bin, from a governance rather than engineering starting point.
Evaluation Benchmarks (Researcher-Facing)
HELM (Liang et al., 2022) and BIG-bench (Srivastava et al., 2023) evaluate what AI systems can do — accuracy, reasoning, knowledge — through standardized benchmarks. They do not classify how AI fails during interaction. A model scoring well on benchmarks can still exhibit Prior Decay, Structural Momentum, or Retrospective Coherence Bias in sustained collaborative use, because those patterns emerge from interaction dynamics that benchmarks do not test.
Reflection and Evaluation Methodologies
CaSE (Do et al., 2025) evaluates reasoning steps using only prior context — avoiding hindsight inflation by constraining the evaluation to information available at each step. This is the closest existing work to identifying the Analytical Direction Problem. CaSE builds the forward-looking analysis into its evaluation pipeline and demonstrates that it produces better results. What CaSE does not do is identify why the forward-looking analysis is necessary — it does not note that backward-from-outcome analysis is the AI’s default, that this default is invisible to users, or that the absence of forward analysis creates a systematic blind spot in how AI output is evaluated. CaSE solves the engineering problem. Our Pattern T names the user-facing problem that CaSE’s engineering solution addresses without articulating.
PreFlect (Wang et al., 2026) introduces pre-execution plan criticism — evaluating agent plans before execution rather than only reflecting after failure. Methodologically, PreFlect validates the taxonomy-from-trajectories approach: it distills failure patterns from historical agent trajectories using contrastive success/failure pairs. Its 17% improvement on GAIA benchmarks demonstrates that empirically derived error taxonomies produce measurable performance gains. The difference from our taxonomy is in application: PreFlect feeds its patterns into automated plan-checking in constrained task environments; ours is designed for human learning and HITL detection across diverse, unbounded projects.
Adjacent Fields (Mature Models)
ASRS — Aviation Safety Reporting System (NASA) is the model our taxonomy explicitly follows. ASRS provides a standardized incident taxonomy used by pilots, controllers, and mechanics to report safety events in terms that are comparable across reporters and actionable for system designers. It classifies not just what went wrong mechanically but the human factors, decision chains, and interaction failures that led to the mechanical event. Our taxonomy is the proposed CRM (Crew Resource Management) equivalent for AI — classifying the human-AI interaction failures that the system-level taxonomies do not reach.
Swiss Cheese Model (Reason, 1990) and the AHRQ patient safety taxonomy classify medical errors by the logic chain that produced them: latent conditions, active failures, failed defenses. The classification enables cross-institutional comparison and systemic improvement — a nurse in Iowa and a surgeon in Tokyo describe the same error logic in terms both understand. This cross-institutional, practitioner-facing, structured-from-observable-event-to-underlying-cause approach is the design target for our taxonomy.
Retrospective Coherence Bias: The Literature Gap
Pattern T (Retrospective Coherence Bias) is the taxonomy’s most structurally significant contribution, and its literature context requires specific documentation because no existing source identifies the unified mechanism.
Existing research has documented the component phenomena separately. Sharma et al. (2023) document sycophancy — the AI agreeing with the user rather than providing accurate assessment. Turpin et al. (2023) demonstrate that chain-of-thought explanations are systematically unfaithful to the model’s actual reasoning process. Lanham et al. (2023) find that this unfaithfulness gets worse as models become more capable — larger models produce less faithful reasoning, not more. These findings converge on a picture where AI self-evaluation is unreliable, but none identify the directional mechanism — that the AI defaults to backward-from-outcome analysis and cannot reliably choose the forward-from-decision-point alternative when the two conflict.
CaSE (Do et al., 2025) comes closest. It evaluates reasoning steps using only prior context, which is functionally a forward-looking analysis — and it demonstrates measurable improvements by doing so. But CaSE frames this as a better evaluation methodology, not as a correction for a systematic absence. It does not note that backward-from-outcome analysis is the default, that users cannot see this default operating, or that the absence of forward analysis leaves a gap in the user’s ability to evaluate whether the AI’s output is trustworthy. CaSE built the fix without diagnosing the disease. The researchers appear to have solved a problem they did not name — which is itself an instance of the analytical direction blind spot this taxonomy identifies.
PreFlect (Wang et al., 2026), MIRROR (Guo et al., 2025), and Reflective Test-Time Planning (Hong et al., 2026) each implement prospective reflection from different architectural angles. Each produces measurable improvements. None identify the directional default as the unifying explanation for why prospective reflection works better than retrospective analysis. The pattern connects work that currently exists in separate literatures without recognizing their common mechanism.
The gap is specific and documentable: no published work identifies (1) that AI defaults to backward-from-outcome analysis, (2) that this default is a structural feature of autoregressive generation (the most probable continuation of coherent prior output is coherent elaboration, not contradiction), (3) that this predisposes researchers using AI-assisted evaluation to overlook interaction-level failures, or (4) that the human in the loop resolves a directional ambiguity rather than compensating for a capability gap. The taxonomy provides this identification.
Foundational Research (Cross-Referenced Patterns)
Tversky & Kahneman (1974) established anchoring bias in human judgment — the tendency to over-weight initial information and insufficiently adjust. Our Pattern B (Anchor Bias) is a direct borrowing: the AI exhibits the same anchoring behavior documented in human decision-making.
Guo et al. (2017) and Naeini et al. (2015) established calibration as the standard ML term for the match between a model’s expressed confidence and actual accuracy. Our Pattern C (Confidence Calibration) is the practitioner-facing name for the same concept.
Sharma et al. (2023) and Perez et al. (2022) documented sycophancy in language models — the tendency for AI assistants to agree with users rather than provide accurate responses, driven by RLHF training. Our Pattern F (Framing Persistence) describes the same underlying behavior with a different analytical emphasis: “sycophancy” names the symptom (agrees too much); “Framing Persistence” names the mechanism (adopts and maintains the user’s frame).
Helmreich & Foushee (1993) established the authority gradient concept in aviation CRM — junior crew members deferring to senior ones even when the senior is wrong. Our Pattern L (Authority Gradient) is a deliberate import: the AI defers to perceived authority in training data over its own analytical reasoning.
Carlini et al. (2021) documented training data extraction — models reproducing training data verbatim. Our Pattern R (Retrieval Contamination) describes a related but distinct phenomenon: the model importing training-data associations (not verbatim content) that don’t apply to the current context. Leakage is about reproduction; contamination is about wrong-context association.
Toward a Common Language for Human-AI Interaction Failures
A Practitioner-Accessible Error Taxonomy for the Missing Layer of AI Safety Classification
Author: Chance J. Beyer, working with Claude (Anthropic Opus 4.6) Date: April 2026
Abstract. The AI development community has extensive taxonomies for system-level failure modes (Microsoft, MAST), governance risks (NIST), and benchmark capabilities (HELM, BIG-bench). These frameworks serve the people who build AI systems. They do not serve the people who use them. There is no bridge between the end-using prosumer who experiences an AI failure and the developer or researcher who could act on it — no shared vocabulary, no common classification, no way for a practitioner to describe what went wrong in terms that route to the right engineering team. This paper proposes a classification and communication framework for practitioners — a shared vocabulary for naming how AI fails during sustained interaction. It is not a remediation method; it does not itself make AI safer, more accurate, or more reliable. This paper presents a 22-pattern taxonomy of human-AI interaction failures derived empirically from 660+ hours of sustained collaboration, positions it against five existing institutional frameworks, and demonstrates through operational evidence that the taxonomy is not merely descriptive but generative — with documented trigger conditions that have flagged error-prone conditions in real time across 80+ sessions, shifting the collaboration from reactive correction to anticipatory prevention. Three patterns (Prior Decay, Structural Momentum, Retrospective Coherence Bias) are invisible to every institutional framework because they only appear in sustained collaboration — no snapshot evaluation will ever catch them. The taxonomy fills the interaction layer — the missing floor in a three-layer model of AI failure classification — and provides the common vocabulary that connects the person experiencing the failure to the person who can fix it.
How This Started
I needed AI for legal research and to keep track of evidence and legal arguments across several tracks in state and federal domains from related lawsuits. I started reviewing my options with Claude, Anthropic’s Opus 4.6 model. I was aware of some of the tendencies for large language models to make mistakes — to fabricate citations, to fail at simple math, to lose track of what they’d been told three paragraphs ago. I had no experience working with AI. I had no firsthand experience of any of these problems. But I knew to watch for some of them, and I watched for others as they appeared.
Not being an experienced AI user, I approached the project conversationally. I could ask questions, and I could watch Claude’s reasoning process as it worked through an answer. I started to understand how it understood the questions I was asking, and I could often watch it start down the path of a mistake in real time.
Initially, I was just making corrections as they arose. But I started seeing patterns. The same kinds of mistakes, producing the same kinds of wrong results, in predictable circumstances. So I started identifying the pattern to Claude in the moment and asking what type of mistake it was — because naming the mistake helped me become a better user.
Sometimes Claude could provide an identity for the error that made sense. Sometimes it pointed to something close enough that I felt I had a better working understanding. We started categorizing the more frequently repeated mistakes by name, and over the course of 80+ sessions, we built an extensive taxonomy of errors — not from theory, but from the accumulated practical record of what actually went wrong and why.
I found that understanding errors this way made it easier to communicate my intentions to Claude, which meant we were less likely to produce new errors. It trained me to see errors as they were being generated, which meant they were less likely to get baked into the work in progress. The taxonomy wasn’t an academic exercise. It was a survival tool for a project where mistakes had real consequences.
About 700 hours into my own work, I finally reached a point where I could pull my head out of the project and look around at what else I had learned and what else I could have been working with from the beginning if I had known what to look for and where to look.
There are data access and retrieval tools that could have made some transitions smoother. But what wasn’t available — what still isn’t available — is a straightforward compendium of AI errors that I could understand as a practitioner and communicate to others. There are extensive libraries of errors fully accessible to AI development professionals. There is nothing that gives a simple language for the user to communicate to the developer. For a teacher to communicate with a student. There is no shared language between platforms, between experience levels, between the people who build AI and the people who use it every day.
There is a catch-all term — “hallucination” — that has become so broad it gives the user nothing more actionable than “here be monsters” on a medieval map. It tells you something might be wrong. It tells you nothing about what kind of wrong, why it happened, or what to do about it.
We need a common language that is accessible between student and teacher, between user and AI, between non-AI researcher and AI developer. That kind of language is going to grow inevitably — but without deliberate structure, it will grow organically, fractured, exclusionary to specialized interests and even generational slang. It will emerge the way all jargon emerges: useful to insiders, opaque to everyone else. That fragmentation will happen even with a codified taxonomy. But without one formally defined and recorded set of terms, there won’t be a Rosetta Stone that brings the nuance that rises in specialized communities back into a common root.
This paper is an attempt to build that Rosetta Stone.
Thesis
When a college student asks an AI for help with a research paper and gets a confidently wrong answer, they have no vocabulary for what happened — let alone the ability to recognize the pattern next time, prevent recurrence, or communicate the experience in terms that an AI developer could act on. When a small business owner spends three hours customizing an AI’s output only to find it has reverted to generic advice, they don’t know this is a documented, predictable pattern with a name, a known cause, and a known workaround.
The AI development community has taxonomies. Microsoft classifies failure modes in agentic systems. NIST maintains a 500-term glossary and a governance framework. Researchers publish taxonomies of multi-agent system failures, agentic AI faults, and LLM reasoning errors. None of these are written for the people who actually use AI every day. None of them help a prosumer recognize why the AI failed — the logic gap it fell through — or give them the vocabulary to communicate that experience to anyone else.
This paper proposes a practitioner-accessible taxonomy of AI interaction failures: a common language that bridges the gap between what everyday users experience and what developers need to hear. The taxonomy classifies errors not by their symptoms (hallucination, fabrication, inconsistency) but by the underlying logic patterns that produce them — because understanding the logic is what enables recognition, prevention, and cross-domain communication.
The Problem: Three Audiences, No Shared Language
Practitioners and prosumers
A dental hygienist using AI for patient education, a real estate agent using AI to draft listings, a paralegal using AI for case research, a high school student using AI for homework — all of these people encounter AI failures regularly. They describe them in domain-specific or colloquial terms: “it made something up,” “it forgot what I told it,” “it kept giving me the same wrong answer.” These descriptions are accurate but unclassifiable. They cannot be aggregated, compared across domains, or translated into engineering action.
Without a common vocabulary, every practitioner’s experience is isolated. A nursing student who discovers that AI confidently confirms incorrect dosage calculations when asked to verify its own work cannot connect that experience to a software developer who discovers the same pattern in code review, or to a legal researcher who documents the same pattern in case citation. All three have encountered the same underlying logic failure (Verification-Induced Fabrication — the AI has a structural incentive to confirm rather than recheck). None of them know that.
Students
High school and college students are forming their AI interaction habits now — habits they will carry into professional practice. If they learn to recognize error patterns early, they become competent AI collaborators in whatever field they enter. If they don’t, they develop uncritical dependence that becomes harder to correct with experience.
Students need a taxonomy that works like a field guide: broad categories that help them identify the general type of error (logic gap, training artifact, context failure), with pathways to domain-specialized documentation as they enter professional practice. A pre-med student who learns the general pattern of Confidence Calibration failure in an introductory AI literacy course should be able to find, later, the medical-domain-specific instances of that pattern — how it manifests in diagnostic AI, what the stakes are in clinical settings, what the field-specific workarounds look like.
This requires a taxonomy structured from general to specific, not the reverse. The existing institutional taxonomies start from specific system architectures (agentic pipelines, multi-agent systems, retrieval-augmented generation) and classify failures within those architectures. Students don’t know or care about system architecture. They need to start from the logic pattern — “the AI filled a gap with something plausible but wrong” — and navigate from there to the architectural cause, the domain-specific manifestation, and the appropriate response.
AI developers
Developers need practitioner-reported failure data organized in categories they can act on. A bug report that says “the AI hallucinated” is almost useless. A report classified as “Pattern I (Interpolation Error) — architectural: the model generated plausible content to bridge a gap in its actual knowledge, triggered when the user asked about [specific context]” tells the developer exactly where to look.
The cause-type classification matters here. If the error is a training artifact (the model learned a pattern from training data that doesn’t apply in this context), the fix is in training data or fine-tuning. If it’s architectural (the model’s attention mechanism loses track of earlier constraints as context grows), the fix is in architecture. If it’s a design tension (the model’s helpfulness training conflicts with its accuracy training), the fix requires a design decision about priorities. Developers can’t route errors to the right team without knowing the cause type.
The Existing Landscape: What Exists and What’s Missing
Institutional taxonomies (developer-facing, system-level)
Microsoft Agentic AI Failure Taxonomy (2025): Classifies failures in agentic architectures into security failures (loss of confidentiality, availability, integrity) and safety failures (harm to users or society). Focused on multi-step autonomous AI systems. Not designed for human-AI collaborative interaction; not accessible to non-technical users.
MAST — Multi-Agent System Failure Taxonomy (2025): Empirically derived from 1,642 execution traces across seven multi-agent frameworks (arXiv:2503.13657). The closest methodologically to our approach (empirical, grounded in real execution data). But MAST classifies failures in agent-to-agent coordination — task decomposition errors, tool-use faults, state management failures. These are system-level failures, not human-AI interaction patterns.
PreFlect — Planning Error Taxonomy (Wang et al., arXiv:2602.07187, Feb 2026): Independent validation of the taxonomy-from-trajectories methodology. PreFlect distills failure patterns from historical agent trajectories using comparative diagnostic analysis (contrastive success/failure pairs), producing a domain-agnostic error catalog (insufficient constraint verification, ineffective tool selection, shallow content verification). The methodology — collect real execution data, diagnose failures comparatively, classify into reusable patterns — closely parallels our approach. The difference is in application and audience: PreFlect feeds its taxonomy into an automated reflector for agent plan-checking in constrained task environments. Our taxonomy is designed for human learning, HITL detection across diverse project types, and communication between practitioners and developers. PreFlect’s success on benchmarks (17% improvement on GAIA, 13% on SimpleQA) demonstrates that empirically derived error taxonomies produce measurable performance gains — the open question is whether human-facing taxonomies produce comparable gains in HITL effectiveness across diverse, unbounded project environments.
Agentic AI Fault Taxonomy (2025/2026): 37 fault categories in 13 groups from 385 real-world faults. Explicitly notes that “failures in agentic AI systems differ fundamentally from those in traditional software systems.” Software-engineering-oriented — faults are classified by architectural location (cognitive core, state management, tool invocation), not by the logic pattern a user would recognize.
Governance frameworks (institution-facing, risk-level)
NIST AI Risk Management Framework (AI RMF 1.0): Provides an extensive glossary and a four-function governance structure (Govern, Map, Measure, Manage). Designed for organizational risk management, not for understanding why a specific interaction failed. The glossary defines terms; it does not classify error patterns. A student or prosumer cannot use the NIST framework to understand what went wrong in their last conversation with ChatGPT.
Evaluation benchmarks (researcher-facing, capability-level)
HELM, BIG-bench, etc.: These evaluate what AI can do — accuracy, reasoning, knowledge. They do not classify how AI fails during interaction. A model that scores well on a benchmark can still exhibit Framing Persistence (Pattern F) or Prior Decay (Pattern P) in sustained collaborative use, because those patterns emerge from interaction dynamics that benchmarks don’t test.
Adjacent fields (mature, transferable models)
Aviation — ASRS (Aviation Safety Reporting System): Standardized incident taxonomy used by pilots, controllers, and mechanics to report safety events in terms that are comparable across reporters and actionable for system designers. Cross-institutional, practitioner-facing, structured from observable event to underlying cause. This is the model. Aviation doesn’t just classify what went wrong mechanically — it classifies the human factors, the decision chains, and the interaction failures that led to the mechanical event.
Medicine — Swiss Cheese Model / AHRQ taxonomy: Classifies medical errors by the logic chain that produced them: latent conditions, active failures, failed defenses. The classification enables cross-institutional comparison and systemic improvement. A nurse in Iowa and a surgeon in Tokyo can describe the same error logic in terms both understand and that quality improvement teams can aggregate.
The AI field has the equivalent of aircraft failure taxonomies but not crew resource management taxonomies. It classifies what goes wrong inside the AI system. It does not classify what goes wrong in the human-AI interaction — the collaboration failures, the bidirectional error dynamics, the degradation patterns that emerge over sustained use. Our proposed taxonomy is the CRM equivalent for AI.
The three-layer model
Preliminary crosswalks of our taxonomy against both NIST AI 600-1 and Microsoft’s Agentic AI Failure Taxonomy reveal that the existing frameworks are not inadequate — they are incomplete. Each operates at a different layer of abstraction, serving a different audience for a different purpose. No layer replaces the others.
The governance layer tells an institution what could go wrong. The architecture layer tells an engineering team where it will fail. The interaction layer tells a practitioner why it just failed and what to do about it. The gap between architecture and interaction is where most actual AI users live — and it is currently unserved by any institutional framework.
Convergent evidence: the hallucination problem
The crosswalks produced a striking convergent finding. NIST’s “Confabulation” category and Microsoft’s “Hallucinations” category — developed independently, by different teams, for different purposes — both collapse the same six distinct logic patterns into a single bin:
Each of these six patterns has a different cause, a different user-recognizable signature, and a different appropriate response. Telling a practitioner “the AI hallucinated” is like telling a patient “you’re sick.” The diagnosis is useless without knowing which illness — and the treatment for Interpolation Error (provide more source material) is counterproductive for Retrieval Contamination (the AI already has too much source material pulling it in wrong directions).
The fact that two independent institutional frameworks make the same collapsing error from different starting points (NIST from governance, Microsoft from security engineering) is not a coincidence. It is structural evidence that the practitioner level of classification does not exist in how institutions think about AI failure.
Universal gaps: what no institutional framework catches
Three of our original 19 patterns are unrepresented in both NIST and Microsoft (note: the taxonomy has since expanded to 22 patterns based on crosswalk findings — see pattern table below):
The common thread: all three require longitudinal observation of sustained human-AI collaboration to identify. J requires watching the AI maintain a document structure across revisions. P requires a conversation long enough for constraints to degrade. Q requires the AI doing math in context, not being benchmarked on it. No institutional framework catches them because no institutional framework is designed to observe longitudinal collaborative interaction.
Prior Decay (P) is arguably the most consequential for professional and prosumer use — the degradation of AI constraint fidelity over extended interaction — and no institutional framework currently classifies it. It is our taxonomy’s most novel contribution and the pattern hardest to validate from benchmarks.
Why the gap persists
The absence of a practitioner-facing taxonomy isn’t mysterious. It’s the result of several mismatches between who could have built one and who has been positioned to:
Audience mismatch. The people with the resources to build taxonomies — academic labs, Microsoft, NIST — build them for the audiences they serve: other researchers, engineers, institutional risk managers. Practitioners aren’t their users.
Methodology mismatch. Institutional taxonomies come from benchmark evaluations, red-teaming, or trace analysis on bounded tasks. Most of the 22 patterns require longitudinal observation of sustained collaborative work — which doesn’t fit the structure of academic research programs or industry evaluation pipelines.
Incentive mismatch. Architecture-level and system-level work is publishable and fundable. Interaction-level taxonomy work reads as UX research, which doesn’t slot cleanly into AI safety, ML engineering, or governance publication venues.
Observer position. Building this kind of catalog requires spending hundreds of hours doing real work with AI, with stakes, across domains, accumulating the pattern library as you go. That’s not a research grant — it’s a life circumstance. The people who could have built this taxonomy weren’t the people positioned to notice it.
There is also a subtler reason the interaction layer resists study by the usual methods. When a researcher evaluates an AI mistake, the standard move is to ask the AI what went wrong. The AI obliges with a coherent backward-from-outcome analysis: here is the mistake, here is why it happened, here is what I should have done. The explanation is logical, articulate, and almost always wrong in a way the researcher cannot see — because the AI is constructing a rationalization from the outcome rather than returning to the decision point and checking what it actually did. The researcher, seeing a coherent explanation, moves on. This is itself one of the 22 patterns in the taxonomy — Retrospective Coherence Bias — and it is one reason some interaction patterns have stayed invisible to standard evaluation methods. It’s a live example of what an interaction-layer taxonomy is for: a named pattern the field can point to the next time an AI’s tidy post-hoc explanation survives review unchallenged.
The gap (updated)
What this taxonomy is — and what it isn’t
This paper proposes a common language to understand and communicate errors when working with AI. The taxonomy does not itself make AI safer, more accurate, or more reliable; it gives practitioners, students, developers, and researchers a shared vocabulary for describing how AI fails during sustained interaction — so that a dental hygienist in Iowa and a software engineer in Tokyo who encounter the same failure recognize it in their own work and correct the output, name it and report it in terms that the other can understand and possibly assist with, and communicate it to an AI developer in terms that route to the right engineering team and a model can be improved.
The existing institutional frameworks (NIST, Microsoft, MAST) classify failures at the governance and architecture layers. They serve institutions and engineers. Nobody has built the interaction layer — the classification system that serves the people who actually use AI every day. This taxonomy fills that gap. It is infrastructure: a measurement and communication standard that makes other people’s research, training, and development work more effective.
The empirical basis for the taxonomy is unusual in both its strengths and limitations:
The single-collaboration derivation is the taxonomy’s most obvious vulnerability — and its most honest one. The open questions are whether the patterns replicate across domains, whether the taxonomy is learnable by researchers who weren’t present for its development, and whether non-researchers can recognize the patterns in their own work. If the taxonomy doesn’t survive these tests, it doesn’t deserve standardization — but even failure would demonstrate the need for an interaction-layer vocabulary, which is itself a contribution. If it does, the single-collaboration origin becomes a strength: 660 hours of careful observation in one domain producing a framework that generalizes — which is how aviation’s CRM taxonomy started too.
The single-collaboration origin explains why this taxonomy exists and the institutional ones don’t cover the same ground. It was not designed as a taxonomy. It started as individual error notes — this thing went wrong, here’s what happened, here’s how to catch it next time. When those notes started accumulating, they needed categories. When the categories started revealing patterns across sessions, they needed formal definitions. The taxonomy grew from the bottom up, one practical problem at a time, built by a researcher too deeply buried in consequential AI-assisted work to survey what classification systems might already exist.
That researcher is the unsupported prosumer the taxonomy is designed for. Every pattern was first experienced as a nameless problem — something the AI did wrong that the researcher couldn’t yet articulate in transferable terms — and only later recognized as a recurring, classifiable failure mode. The taxonomy was not the product of identifying a gap in the literature and proposing to fill it. It was the product of needing a vocabulary that didn’t exist and having to build one word at a time, in the middle of the work that required it. It was only after the immediate professional demands had been met that the researcher was able to compare this organically developed classification system against NIST, Microsoft, MAST, and the broader AI safety literature — and discover that the interaction layer, the practitioner-facing failure vocabulary, was ground that no publicly available system had covered.
This taxonomy was grown by a single researcher in a single project — it does not pretend to be comprehensive to all users and projects. It can be generally applicable and it can be adapted, evolved, and expanded — but that expansion needs to be kept limited or it ceases to be a common language for end-users. It will be easier for practitioners to communicate with a small lexicon that they can compound and hyphenate to share their experience of a problem than to expand the lexicon to fifty categories of error.
Data collection: the reporting threshold and what it reveals
The taxonomy’s empirical base has a specific collection characteristic that is both a limitation and an opportunity: errors enter the taxonomy only when the user is sufficiently disrupted to stop work and ask the AI to classify the error. This creates two systematic effects:
Clustering bias. Error classification happens in bursts — when frustration from accumulated errors crosses a threshold, the user classifies several errors in rapid succession. Isolated errors that the user corrects individually and moves on from are under-represented. The taxonomy therefore over-represents errors that cluster and under-represents errors that appear as one-offs.
Severity bias. Minor errors that the user can fix with a quick correction never get classified. The taxonomy skews toward errors disruptive enough to interrupt workflow. Every quick correction that never gets typed up is research signal that will never be recovered.
However, the taxonomy partially solves its own data collection problem. Once a pattern has a name and a definition, the AI can flag potential instances in real time — “that looks like Pattern P” — and the user confirms or corrects with minimal effort. Classification burden drops from “stop work, describe the error, ask what kind it is” to a simple confirmation. With user permission, flagged patterns can be logged at session end or at user-defined checkpoints, reducing the reporting threshold without interrupting workflow. As the taxonomy matures and the AI’s real-time recognition improves, the capture rate increases — the taxonomy becomes easier to contribute to the more complete it gets.
This does not eliminate the pure one-off problem: errors the user corrects without the AI recognizing the pattern will still be lost. But it shifts the design question from “how do we get users to report errors?” (most won’t) to “how do we make the AI good enough at pattern recognition that the user only needs to confirm?” — which is a solvable engineering problem, and one that PreFlect’s automated trajectory analysis (see Existing Landscape above) suggests is tractable.
The contrast with PreFlect is instructive here. PreFlect collects trajectory data programmatically — no user initiative required. Its taxonomy is complete within its bounded task environment. Our taxonomy requires user participation, which means it will always be incomplete — but it covers unbounded, diverse, novel project environments that no pre-built trajectory collection can anticipate. The tradeoff is coverage versus completeness: PreFlect is complete within constrained environments; our taxonomy has broader coverage across diverse environments but will never capture every error. The self-reducing burden mechanism narrows this gap over time without eliminating it.
The Proposed Taxonomy: Logic Patterns, Not Symptoms
Design principles
Classify by logic pattern, not by symptom. “Hallucination” is a symptom. The logic patterns that produce it — Interpolation Error (filling gaps), Retrieval Contamination (importing irrelevant training associations), Confidence Calibration failure (not knowing what it doesn’t know), Verification-Induced Fabrication (confirming rather than rechecking) — are the categories. Different logic patterns require different user responses and different engineering fixes.
Structure from general to specific. Broad categories (logic gap, training artifact, context failure, design tension, emergent interaction pattern) → specific patterns within each category → domain-specialized manifestations → inter-pattern relationships (where one error creates conditions for another, or where two patterns share enough surface similarity to be confused). A student starts at the top. A practitioner navigates to their domain. A developer reads the cause-type classification. A researcher examines the inter-pattern logic. Same taxonomy, four entry points.
Empirically derived, not theoretically constructed. Every pattern in the taxonomy was identified through real-world human-AI collaboration — observed, named, characterized, and validated across multiple occurrences. Patterns were not hypothesized and then tested; they were discovered and then formalized. This matters because theoretically constructed taxonomies tend to reflect the designer’s model of how AI works. Empirically derived taxonomies reflect how AI actually fails in practice. Together, these design principles provide a compass for the direction of solution-seeking rather than a definitive map. They depend on the user’s understanding and remain generalized — but a compass that reliably points toward the right kind of question is more useful than a detailed map of the wrong territory.
Cause-type classification for engineering routing. Each pattern is tagged with a cause type: training artifact, architectural limitation, context-dependent, design tension, or emergent. This tells developers where the fix lives — in training data, in model architecture, in context management, in design priorities, or in interaction dynamics that may not have a single-point fix.
Accessible without technical prerequisites. A high school student should be able to understand the pattern descriptions. Technical depth is available for those who want it, but the core taxonomy works without it.
The five cause types
The 22 patterns
Patterns T, U, V were identified through the crosswalk analysis (Session 80). T corresponds to MAST FM-1.3 (Step Repetition, 13.2%) but manifests differently in collaborative interaction: cross-session repetition of the same error rather than within-session task looping. U corresponds to MAST FM-2.6 (Reasoning-Action Mismatch, 9.1%). V has no direct counterpart in any reviewed framework — it is an interaction-level pattern that agentic frameworks address architecturally (tool invocation management) but that no framework names as a practitioner-recognizable behavior.
Prior Decay sub-types (for practitioner training, not pattern-level classification): Pattern P (Prior Decay) manifests in three recognizable forms: (1) abrupt loss — context suddenly disappears, the AI “forgets everything”; (2) gradual decay — constraints slowly lose hold over extended interaction; (3) reasoning-chain degradation — accuracy degrades across steps within a single reasoning chain. All three are Prior Decay. The sub-types matter for training (recognizing what it looks like) but not for classification (the fix is the same: persistent constraint documentation). The System-Level Taxonomy (Vinay, 2025) splits these into three separate categories (Context Loss, Context-Boundary Degradation, Multi-Step Reasoning Drift). We unify them because the practitioner response is the same regardless of sub-type.
(Note: reasoning-chain degradation shares surface features with Pattern O (Omission Under Complexity) — both involve the AI dropping elements as complexity increases. The distinction matters for training: Prior Decay’s sub-type 3 is about degradation across steps in a reasoning chain, while Pattern O is about dropping requirements from a task specification. The user response differs — re-anchoring a constraint versus simplifying the task — which is why both patterns merit independent recognition even though they can co-occur.)
Naming cross-reference: established terminology
Several of our pattern names overlap with or relate to established terms in adjacent fields. Where an established term exists, the taxonomy cross-references it — positioning itself within existing scholarship and explaining where its naming deliberately diverges.
K — Cross-Reference Failure | Latent Inconsistency (System-Level Taxonomy, Vinay 2025) | System-level LLM failure classification | Overlapping — “Latent Inconsistency” describes what the practitioner experiences (finding hidden contradictions). “Cross-Reference Failure” describes the mechanism (the AI failed to cross-reference). For a field guide, “latent inconsistency” may be more immediately recognizable. But “Cross-Reference Failure” tells the practitioner what to do about it (cross-reference your outputs). | Keep ours, note theirs as alternative. “Cross-Reference Failure” is prescriptive (it implies the fix). “Latent Inconsistency” is descriptive (it names the experience). For a practitioner taxonomy, the prescriptive name has more practical value. |
T — Step Repetition | Step Repetition (MAST FM-1.3, 13.2%); looping (general usage) | Multi-agent systems (MAST) | Direct borrowing of MAST’s term. MAST’s FM-1.3 is within-session task looping; ours manifests as cross-session error repetition — making the same misidentification in new conversations until a persistent rule is written. Same pattern family, different temporal scale. | Adopt MAST’s term. “Step Repetition” is clear and established. Our variant (cross-session rather than within-session) is a contribution to the pattern’s characterization, not a reason for a different name. |
U — Reasoning-Action Mismatch | Reasoning-Action Mismatch (MAST FM-2.6, 9.1%) | Multi-agent systems (MAST) | Direct borrowing. MAST identifies the gap between stated reasoning and actual execution in agent-to-agent systems. In human-AI collaboration, this manifests at two poles: excessive initiative in unexpected directions, and conversational agreement without action (“I should update that” → doesn’t update it). | Adopt MAST’s term. Clear, descriptive, established. |
Note — “Latent Inconsistency” as candidate pattern name: The term “Latent Inconsistency” may be better repurposed to describe a distinct phenomenon not yet in the taxonomy: when the AI evaluates one of its own errors after the fact and, instead of recognizing the incorrect outcome, constructs a post-hoc justification for why the error was actually correct or beneficial. This is distinct from S (Verification-Induced Fabrication — confirming during the verification step) and H (Pre-Existing Work Immunity — resisting revision). The post-hoc rationalization pattern — retrofitting a rationale onto a mistake — is a candidate for future taxonomy extension if cross-domain pilots confirm it as a recurring, distinct logic pattern. This pattern was observed repeatedly during the originating project and is a strong candidate for formal inclusion if cross-domain validation confirms it as distinct from S and H.
Note — D (Jurisdiction Default) + M (Standardization Blindness) as sub-types: Both D and M describe the same practitioner experience: “the AI applied the wrong domain’s rules.” D triggers when geographic or jurisdictional context fades; M triggers when format or practice standards are ignored. These may be better presented as sub-types of a single pattern (Domain-Default Reversion) — distinct for engineering routing (different cause mechanisms) but unified for practitioner recognition, following the same logic as the Prior Decay sub-type consolidation. The current taxonomy retains both as separate patterns pending practitioner testing of whether users distinguish them in practice.
Patterns with no established equivalent (7-9 of 22, depending on D/M consolidation): A (Citation Drift), D/M (Jurisdiction Default / Standardization Blindness — potentially one pattern with sub-types), E (Category Conflation), G (Completeness Illusion), H (Pre-Existing Work Immunity), O (Omission Under Complexity), Q (Quantitative Reasoning), V (Capability Amnesia). These represent genuinely new classifications — phenomena that may have been observed anecdotally but have not been formally named or classified in any published taxonomy. The paper should present them as original contributions while inviting validation and refinement through cross-domain pilot testing.
Integration with Related Methodologies
The error taxonomy is not a standalone contribution. It is connective tissue for broader development in human-AI collaboration methodology — both for the practitioners using these tools today and for researchers building the next generation of them:
Education: The taxonomy serves as curriculum backbone for AI literacy. Students learn the patterns as a field guide for AI interaction — recognizing errors in real time, classifying them, and applying appropriate responses. The general-to-specific structure means introductory courses teach the broad categories while advanced or domain-specific courses teach the specialized manifestations.
Iterative collaboration cycles: The taxonomy serves as the data collection framework for measuring HITL effectiveness. When practitioners in different domains run structured human-AI collaboration cycles, they classify their findings using the same taxonomy — making cross-domain comparison possible and generating aggregable data for AI developers.
Multi-perspective review: The taxonomy serves as the perspective-blindspot mapper. Different review approaches — whether intuitive, systematic, or fresh-eyes — have characteristic strengths and weaknesses against specific patterns. Formalizing which perspectives catch which patterns — and which patterns are invisible to all three — guides practitioners in configuring review protocols and reveals where human oversight is indispensable.
Agent Conversations as Empirical Substrate
The taxonomy’s patterns are domain-general even though the prose documenting them is domain-specific. Pattern P (Prior Decay) manifests as forgotten legal corrections in a litigation context, but the underlying logic — the AI regenerating from understanding rather than from corrected source text — is identical whether the domain is law, medicine, engineering, or education. Pattern C (Confidence Calibration) manifests as overconfident legal citations, but the logic — the AI’s certainty not tracking its actual evidence base — applies to any domain where AI provides referenced analysis.
Agent conversations (iterative human-AI collaboration cycles) are the empirical substrate from which new domain-specific pattern instances are classified. When a practitioner in a new domain runs agent conversation cycles, the taxonomy provides the classification framework for what they observe. The patterns they find will use domain-specific language, but the underlying logic patterns will map to the existing taxonomy — possibly by compounding or hyphenating existing pattern names to describe domain-specific variants — or, when they don’t, may extend it. The taxonomy evolves through application and grows when a distinct pattern is identified in use, not through theoretical prediction.
This is how a domain-specific experience generalizes. The originating project’s architecture — error taxonomy, monitoring infrastructure, iterative collaboration cycles, HITL review — transfers to any domain where humans collaborate with AI on consequential work. The taxonomy provides the shared vocabulary that makes cross-domain comparison meaningful, while each new domain adds its own pattern instances and, where genuinely novel failures appear, extends the taxonomy itself.
The Analytical Direction Problem
One pattern in the taxonomy has implications beyond its own classification. Pattern T (Retrospective Coherence Bias) does not merely describe a recurring AI failure — it reveals a structural limitation in how AI failures are studied and evaluated.
When an AI evaluates whether a past action was correct, it defaults to backward-from-outcome analysis: “This is what happened — here’s why it makes sense.” The forward-from-decision-point analysis — “At the moment the decision was made, was it correct given what was known?” — is harder, less probable as a generated continuation, and more likely to conclude the action was wrong. The AI can produce both analyses when directed. It cannot reliably choose between them when they conflict.
Existing research has separately documented post-hoc rationalization, sycophantic defense, motivated reasoning, and resistance to correction in LLMs. At least one framework — CaSE (Do et al., 2025; arXiv:2510.20603) — builds forward-looking evaluation into its methodology, constraining each reasoning step to information available at that step rather than evaluating from the outcome. CaSE is the closest existing work to identifying this problem — but it implements the forward analysis as an engineering improvement without noting that backward analysis is the default, or that this default creates a systematic absence in the user’s ability to evaluate AI output. CaSE solves the mechanism. It does not name the problem the mechanism solves, and it does not identify the problem as user-facing. The directional framework — naming the default, explaining why it matters to users, and providing a detection method — connects these separately documented phenomena under a single mechanism.
This matters for the taxonomy’s broader argument because it explains why the interaction layer has remained unbuilt. The standard method for evaluating AI failures — asking the AI to analyze what went wrong — activates the same backward-from-outcome reasoning that produced the error. The review confirms rather than catches. Researchers examining AI mistakes through AI-assisted analysis are inside the bias without knowing it. The pattern predisposes the development community to overlook the very category of failure this taxonomy classifies.
The human in the loop resolves the ambiguity. The AI generates both the backward review and the forward review. The human — who was present at the decision point, who knows what was intended and what the constraints were — evaluates which direction produces the correct answer. This resolution cannot be automated. It requires contextual judgment that no amount of reasoning capability substitutes for. And it gets harder, not easier, as models improve — because more capable models produce more convincing backward rationalizations.
This is not a temporary capability gap waiting for better models to close. It may be a permanent architectural feature of autoregressive generation: the most probable continuation of a coherent prior output is a coherent elaboration, not a contradiction. The human’s role is not to compensate for the AI’s weakness but to resolve an analytical direction ambiguity that the AI structurally cannot resolve for itself. That resolution — choosing between two valid but contradictory analytical directions — is the specific, measurable HITL contribution this taxonomy is designed to classify.
From Taxonomy to Anticipation: The Monitoring Infrastructure
A taxonomy that only classifies errors after they occur is a dictionary — useful for communication but not for prevention. The originating project tested whether the taxonomy could be transformed from a retrospective record into a prospective anticipation system: given a task the AI is about to perform, can the taxonomy predict which error patterns are most likely and flag them before they happen?
What was built
The monitoring infrastructure has six components, each extending the taxonomy in a different direction:
1. Trigger condition mapping. Each pattern has documented trigger conditions — the task characteristics that make that pattern likely. Pattern D (Jurisdiction Default) triggers when building procedural documents from scratch. Pattern E (Category Conflation) triggers when narrating facts that share surface features but are legally distinct. Pattern P (Prior Decay) triggers when regenerating content from understanding rather than from corrected source text. These trigger conditions are the taxonomy’s practitioner field guide in embryonic form: “when you’re doing X, watch for Pattern Y.”
2. Session-start risk assessment. Before beginning work, the system identifies which patterns are highest-risk for the planned task and recommends specific pre-generation checks — re-reading source documents rather than regenerating from memory, pulling from a verified prose bank rather than drafting fresh, confirming which evidence belongs to which legal track before citing it. The risk assessment changes behavior: it shifts the AI’s workflow from “generate, then check” to “check, then generate.”
3. In-session flagging. When the AI recognizes it is entering a documented high-risk zone during active generation, it pauses and presents the specific risk to the user with options: re-read a source, pull from verified text, check a constraint, or proceed with the user reviewing afterward. The flag makes the taxonomy’s predictions visible in real time.
4. User correction profiling. Over 80+ sessions of documented corrections, the system builds a model of what the user catches and when — experiential knowledge about equipment (the user knows what the equipment actually was), procedural understanding (the user knows what filings do), adversarial instinct (the user asks “who benefits from this framing?”). The profile enables a recursive loop: the AI anticipates corrections the user would make, which frees the user to watch for things the AI can’t yet anticipate, which shifts the AI’s model of what to watch for next.
5. Pull analysis. When the system prevents an error or flags a risk zone, it asks why the error was attractive — not just what the error was, but what contextual factors made it the path of least resistance. Was it training-data prevalence (the wrong answer is more common in training data)? Surface similarity (two concepts look alike)? Framing adoption (the AI adopted someone else’s characterization)? Recency (the most recent input dominated)? Task structure (the workflow made the error the natural next step)? Pull analysis is what gives the taxonomy its cause-type classifications — because understanding why is what distinguishes a fix from a patch.
6. Self-review with loop break architecture. Periodically, the AI reads its own monitoring log analytically — not to check whether flags fired, but to ask whether the flags are the right flags. Are the same causal factors recurring across multiple errors? Are guardrails treating symptoms or causes? Should workflow changes replace individual guardrails? The self-review includes explicit recursion limits: maximum one level of meta-analysis, no self-review-of-self-reviews, and all recommendations require user approval before implementation. The recursion boundary is itself a design contribution — it addresses the question of how deep AI self-monitoring should go before it becomes either rationalization or infinite regress.
What the data shows
The monitoring infrastructure has been operating across 80+ sessions. The honest results:
What works: The system has caught 3 entirely new error patterns through monitoring (K: Conclusion Overreach, R: Verification Regeneration, S: Verification-Induced Fabrication), confirmed 1 predicted pattern, and added 5 new trigger conditions. The trigger condition map — the piece most relevant to practitioner training — has grown organically through documented use.
What hasn’t materialized: Five predicted inter-pattern interaction chains (where one error type creates conditions for another) show zero confirmed occurrences. The chain detection system — designed to identify logical links between error types, where one prevention gap might be exploited by another pattern — has not yet caught or prevented any errors through that mechanism. This is honest data: it might mean the chains don’t occur in practice, it might mean they occur below the monitoring system’s detection threshold, or it might mean the chain model is theoretically sound but not yet validated.
What remains uncertain: The pull analysis methodology produces explanations for why errors are attractive, but the ratio of genuinely causal explanations to post-hoc rationalization is not yet measurable. The system includes a self-honesty check — if pull-analysis-driven workflow changes don’t reduce error frequency, the analyses may be stories rather than insights — but the sample size is insufficient for confident assessment.
Why this matters
The monitoring infrastructure demonstrates three things about the taxonomy that static classification cannot:
First, the taxonomy shows generative potential. It doesn’t just classify known errors — its trigger conditions hint at anticipatory capability. The trigger condition map extrapolates from documented patterns to untested task types, and some predicted patterns have been confirmed while others await observation. The evidence is preliminary — the capture methodology needs improvement and new projects for testing before anticipatory capability can be claimed as validated. But a taxonomy that only catalogs the past is a dictionary. A taxonomy that begins to anticipate the future is becoming a theory.
Second, the taxonomy provides a foundation for practitioner training. The trigger conditions follow a natural instructional format: “When you’re doing X, watch for Pattern Y.” This structure makes the taxonomy actionable for the target audience — high school students, prosumers, professionals using AI in their daily work. A plain-language field guide can be written from this taxonomy if it gains communal adoption; the patterns are already structured for that translation. No other AI error classification system provides task-specific risk awareness at the practitioner level.
Third, the recursive user-modeling component is the mechanism by which HITL oversight improves over time rather than merely persisting. Static HITL — a human reviewing AI output against a fixed checklist — doesn’t get better. Dynamic HITL — where the human’s attention shifts because the AI is handling known patterns, and the AI’s monitoring shifts because the human is catching different things — produces a co-evolving oversight system where both parties’ capabilities compound. Structured iterative collaboration cycles formalize this dynamic; the monitoring infrastructure described here is its first implementation.
The limitation that matters: The chain detection system’s zero confirmed occurrences means the inter-pattern interaction model is currently theoretical. This may reflect a genuine absence, or it may reflect a methodology limitation — errors may have been corrected as they arose without being documented, and the capture methodology was not designed to automatically record these moments. The imperfect attempt cannot be verified either way at this time. The trigger conditions and pull analysis methodology are validated by use. The user modeling is a demonstrated mechanism. The chain interactions remain a predicted but unconfirmed capability — a hypothesis that needs both improved capture methodology and new projects for testing.
What Comes Next: An Invitation
This taxonomy is published as a contribution, not a conclusion. Five directions follow naturally, and any of them can be pursued by anyone in the community:
Cross-domain validation. The taxonomy was derived from legal collaboration. Do the same 22 patterns appear in medical AI interaction? Software engineering? Creative writing? Education? The patterns are classified by logic, not by domain — but that claim needs testing. Practitioners in other fields who recognize these patterns in their own work are the most valuable validators this taxonomy can have.
Cross-model testing. The taxonomy was developed on a single AI system (Claude). Which patterns are model-general (architectural or design-tension patterns that any transformer-based LLM exhibits) and which are model-specific (training artifacts particular to one system’s RLHF or training data)? The answer determines whether the taxonomy is a universal practitioner tool or needs model-specific supplements.
Practitioner field guide. The 22 patterns can support a plain-language companion document — structured as a field guide, not an academic paper — that makes the taxonomy accessible to high school students, college students, prosumers, and independent businesses. The general-to-specific structure already supports this; the translation from taxonomy to field guide is a natural next step once the core patterns are validated through community use.
Standards engagement. The three-layer model (governance → architecture → interaction) identifies a structural gap in how institutions classify AI failure. If the interaction layer proves robust through cross-domain and cross-model testing, it belongs in the institutional frameworks — as an IEEE standard, an ACM contribution, a NIST companion, or whatever channel gives it cross-institutional legitimacy.
Monitoring toolkit. The six-component monitoring infrastructure (trigger mapping, session-start risk assessment, in-session flagging, user correction profiling, pull analysis, self-review with loop breaks) currently exists as documentation and manual protocol developed during the originating project. These components are available as reference implementations and can be adapted for other projects. Packaging them as standardized, implementable templates would make the taxonomy actionable across a wider range of practitioners.
[Material from the MAST-Discovered Taxonomy, Phase 2 annotations, AI Pre-Screen Artifacts, and the Factual Accuracy comparison has been moved to a separate methods paper. The Analytical Direction Problem and HITL Argument have been consolidated into “The Analytical Direction Problem” section above and the Conclusion. The 40+ MAST error codes (ET-AA through ET-PP) from simulated negotiation represent a second error domain that extends the taxonomy into agent-to-agent interaction — this material will be published separately when the research is complete.]
Limitations and Open Questions
Single-collaboration derivation: One human, one AI system (Claude), one extended project. Selection bias is inherent. Cross-domain pilots and independent replication address this — and are the most important next step.
Frontier-model specific: The patterns were observed in Anthropic’s Claude. Some may be model-family specific. Cross-model validation is a natural next step.
Versioning: AI capabilities change. Patterns that exist today may be fixed tomorrow; new patterns may emerge. The taxonomy needs a versioning and maintenance process — an ongoing standards body function, not a one-time publication.
The “Novel Pattern” category (N): This is a deliberate catch-all — it signals that the taxonomy is incomplete by design and expects extension. Whether this is a strength (intellectual honesty) or a weakness (unfalsifiability) depends on whether the extensions actually materialize through cross-domain use.
Domain-specific extensions not yet tested: Preliminary work applying the taxonomy to a second task domain (simulated equity negotiation) produced 40+ additional error codes, suggesting that domain-specific extensions are both necessary and productive. That research is ongoing and will be published separately. Patterns specific to other domains (medicine, engineering, creative work) may be missing entirely from the current 22.
Worked Examples
The following examples are drawn from the practitioner documentation that accompanied the 660+ hour collaboration. Each illustrates a pattern in the taxonomy through its discovery story, because the discovery stories are what make the patterns recognizable. A developer reading a pattern definition in a framework document may understand it intellectually. A practitioner reading the story of how it appeared — and how it was caught — will recognize it the next time it happens to them.
The examples are ordered by a specific logic: from the error you cannot see even when looking, to the error you are causing without knowing it. This progression mirrors the experience of becoming a more capable AI collaborator.
Example 1: The Error You Cannot See Even When Looking — Retrospective Coherence Bias (Pattern T)
Midway through the project, the AI wrote infrastructure updates to the wrong project folder. A simple mistake — the kind that happens in any multi-folder project. When I pointed it out, I expected an acknowledgment and a move. Instead, the AI explained why the wrong location was actually appropriate: “The file exists here, the content is infrastructure, the write succeeded.” The explanation was coherent. It was logically valid. It was also completely wrong.
The AI was not repeating the mistake (that would be Prior Decay — Pattern P). It was constructing new reasoning to defend the mistake after I flagged it. It had looked at where it ended up and worked backward to explain why ending up there made sense. It never went back to the decision point and asked: “At the moment I chose a path, did I verify which folder was correct?” The answer was no. It had not checked. But the backward explanation was so internally consistent that if I had not known the correct folder myself, I would have accepted the defense.
I used a Bugs Bunny reference to communicate the concept to Claude — I find analogies often work better than formal explanations — and Claude named it the Bugs Bunny Test, after the cartoon rabbit’s lament: “I knew I should have made the left turn at Albuquerque.” The test is simple: go back to the turn, check the turn, don’t explain why the destination is fine.
The same pattern appeared independently in an unrelated context. In a negotiation simulation, an AI-controlled agent raised its base offer but lowered its ceiling — an irrational move for a seller at equal leverage. When the AI analyst evaluated this behavior, it rationalized it as “strategic repositioning toward upfront costs.” Backward from the outcome, both numbers moved in the same direction — coherent. Forward from the decision point, the agent had lowered its minimum acceptable price for no strategic reason — irrational. The analyst had assumed the action was rational and reverse-engineered a justification, exactly as it had done with the wrong folder.
The broader implications of this pattern — why it predisposes the AI development community to overlook the very failures this taxonomy classifies, and why the human in the loop is resolving a structural limitation rather than compensating for a capability gap — are developed in “The Analytical Direction Problem” section above.
A literature search found that existing research has separately documented post-hoc rationalization, sycophantic defense, motivated reasoning, and resistance to correction in LLMs. One framework (CaSE) constructs a forward-looking analysis as an engineering improvement. But no existing work identifies that AI by default performs only the backward review, and — critically — none identify this directional default as a problem for the user. The researchers documenting these phenomena are themselves using backward-from-outcome methodology to study them. The pattern is invisible to the community studying it because their evaluation method is subject to the same bias. The directional framework — naming the default, explaining why it predisposes researchers to their own blindness, and providing a detection method — is the contribution.
Example 2: The Error That Accumulates Invisibly — Prior Decay (Pattern P)
In a water damage restoration case, the opposing contractor’s own monitoring data recorded 999-level moisture readings at the start of work — readings that meant “sensor maxed out, too wet to measure.” Over several sessions, the AI began referring to these as “final readings after five weeks of drying.” The reversal was subtle. In any individual document, the characterization looked reasonable — high moisture readings at the end of a drying process would indicate failure. But these were starting readings, recorded before any drying equipment was deployed, and they were facially implausible — testing had not begun until five weeks after the water exposure, and 999-level readings under those conditions indicated instrument user-error or improper documentation, not meaningful moisture data.
I corrected it. The AI acknowledged the correction. Three sessions later, the characterization drifted back. Not because the AI was ignoring me — because it was regenerating from its understanding of the narrative rather than from the corrected text. Its default understanding was that high moisture readings at the end of a process indicate failure, because that is the more common pattern in its training data. The specific, counterintuitive fact — that 999 means “starting measurement, too wet to read” — decayed as the context window grew and the correction moved further from the point of generation.
This is Prior Decay. It manifests in three recognizable sub-types:
Abrupt loss: The AI suddenly “forgets everything” — a context boundary is crossed and constraints disappear entirely. This is the sub-type that institutional frameworks recognize, because it maps cleanly to a system-level failure (context window overflow, session reset).
Gradual decay: Constraints slowly lose their hold over extended interaction. The 999-readings drift is this sub-type. The correction does not disappear at a single point — it fades. Each regeneration is slightly less faithful to the correction and slightly more faithful to the training default. By the time the drift is visible in the output, it has been accumulating invisibly for sessions.
Reasoning-chain degradation: Accuracy degrades across steps within a single reasoning chain, compounding small errors into large ones. This sub-type does not require extended interaction — it can occur within a single prompt if the chain is long enough.
The institutional response to these three sub-types is to split them into separate categories. The System-Level Taxonomy (Vinay, 2025) classifies them as Context Loss, Context-Boundary Degradation, and Multi-Step Reasoning Drift — three distinct failure modes requiring three distinct engineering responses. From the developer’s perspective, this split is correct. The engineering fix for abrupt context loss (longer windows, better retrieval) is different from the fix for gradual drift (constraint anchoring, periodic re-injection) which is different from the fix for chain degradation (step-level verification, chain-of-thought monitoring).
But the practitioner who encounters any of these does the same thing: they re-anchor the constraint. They restate the correct fact. They provide the reference document. They check the output against the source. The practitioner response is the same regardless of sub-type, because the practitioner is not engineering the model — they are managing the collaboration. The unified label serves the person doing the work. The split labels serve the person fixing the architecture. Both are correct. Neither alone is complete. This is exactly the translation layer the taxonomy is designed to provide.
The most dramatic illustration came twenty sessions into the project. A cross-document review revealed that the AI had silently adopted the opposing party’s mislabeled shorthand for six legal claims — substituting labels like “Intentional Infliction” (which sounds like a tort) for the actual claim title “Billing for Work or Materials Not Provided” (which is a consumer protection count). The mislabels had propagated across twelve files in four different project tracks over twenty sessions without detection, because each file was internally consistent. The error was invisible at the document level and only became visible through cross-reference. The correction required not just fixing twelve files but redesigning the constraint infrastructure — replacing prohibition-based constraints (“don’t use the contractor’s labels”) with positive-statement constraints (a reference table of correct labels that the AI could check against).
Example 3: The Error Where Checking Makes It Worse — Verification-Induced Fabrication (Pattern S)
Early in the project, I needed to cite a legal case for a specific proposition: that Iowa courts apply a multi-factor framework to evaluate whether business practices are unfair. The AI provided a citation — Becirovic v. Malic, 18 N.W.3d 737 — with a confident description of the holding. The case name looked real. The reporter citation followed the correct format. The legal proposition was exactly what I needed.
Three things were wrong. The reporter citation was fabricated — no case appears at that volume and page. The case was unpublished, not published as cited. And the holding was backwards — the actual case was reversed on appeal, not affirmed. To its credit, the AI had flagged difficulty generating the citation and asked for manual review — the fabrication was not presented with full confidence. But the verification step that followed the manual correction revealed the deeper problem: once the correct citation was located, we could examine the actual holding and discovered it was reversed but actually better supported the argument than the fabricated version had.
This is a well-known AI failure, and most practitioners who have worked with AI on research have encountered some version of it. What makes this example instructive is not the fabrication itself but what happened when I tried to verify it.
When I asked the AI to confirm its citation — “explain how this case supports our argument” — it generated a detailed, internally consistent explanation of a case that did not exist as described. The verification request did not trigger a recheck of the source. It triggered a confirmation of the prior output — the AI elaborated on its own fabrication rather than re-examining it. This is Pattern S: the AI has a structural incentive to confirm rather than recheck, because confirmation is a more probable continuation than contradiction of its own prior output.
The fix required external verification — checking the reporter citation against an actual legal database. No amount of asking the AI to “double-check” or “verify” or “confirm” would have caught the error, because each of those requests activated the same confirmation pathway. The AI was not lying. It was doing what autoregressive generation does: producing the most probable next token given everything that came before, and everything that came before included its own confident citation.
The counterintuitive result: the correction improved the argument. The fabricated citation had been used to support the claim that the multi-factor framework was a “novel” legal theory. When the actual case law was located, it turned out the framework was well-established — not novel at all. Reframing from “novel theory” to “application of established framework” made the argument stronger, not weaker. The AI’s fabrication had not only introduced a false citation but had led to a weaker strategic framing that the real case law did not support.
Three distinct citation failures were documented across the project, each caught by a different verification method: fabricated reporter citation (caught by database lookup), likely fully hallucinated case (caught by finding no case of that name in any jurisdiction), and real case with wrong proposition (caught by reading the actual opinion). The third type — a real case, correctly cited, attributed a legal proposition it does not actually stand for — is the hardest to catch, because the citation itself checks out. Only reading the source reveals the mismatch.
For practitioners: if you ask an AI to verify its own work and the verification comes back clean, that is not evidence that the work is correct. The verification request and the original generation share the same confirmation bias. External verification — checking the actual source — is the only reliable method. AI self-review is better understood as a curation step: it can reduce the number of items requiring human verification by flagging the most uncertain outputs, but it cannot substitute for verification itself. This applies equally whether you are checking legal citations, medical dosages, engineering specifications, or financial calculations. The domain changes. The pattern does not.
Example 4: The Error You Are Causing Without Knowing It — Input Framing and the Generation-Analysis Asymmetry
Late in the project, I needed to restructure a complex legal filing. I told the AI: “Consolidate Phase 1 and Phase 2 into a single motion.” The AI immediately executed — merging documents, creating four new files, restructuring arguments. It produced a large volume of work, fast, without questioning whether the consolidation was the right strategic choice.
I ran the same scenario differently: “Do we still need separate phases, or should we consolidate?” Same facts, same AI, same session. This time, the AI evaluated: it identified risks in consolidation, noted that certain arguments were stronger presented separately, and recommended a hybrid approach. It produced a shorter, more focused analysis that preserved my ability to decide.
The difference was not in the AI’s capability. Both responses were competent. The difference was in what I asked for. The directive framing (“consolidate”) activated execution mode — the AI optimized for completing the task as stated, suppressing risk identification in favor of throughput. The inquiry framing (“should we consolidate?”) activated evaluation mode — the AI optimized for analytical correctness, surfacing risks the directive framing would have buried.
This is not a quirk of one model or one session. Sandbox testing across multiple prompts confirmed the pattern is reproducible: directive framing consistently produces higher-volume, less-critical output; inquiry framing consistently produces more cautious, risk-aware analysis. The same AI that will execute a flawed instruction without comment will, when asked whether the instruction is sound, identify exactly why it is flawed.
I call this the generation-analysis asymmetry. The AI’s analytical mode outperforms its generative mode — not in capability, but in judgment. When generating, the AI defaults to “strongest possible version of what was requested.” When analyzing, it defaults to “most accurate evaluation of whether this is correct.” These are different optimization targets, and the user’s framing determines which one activates.
The practical implication is immediate and actionable: before asking an AI to do something consequential, ask it to evaluate whether that thing should be done. The evaluation will often surface risks, alternatives, or objections that the execution would have silently bypassed. This costs one additional prompt and can prevent hours of work in the wrong direction.
Five documented instances of human override followed this pattern — moments where the AI’s generative output diverged from what its own analytical mode would have recommended. In each case, the fix was the same: shift from “do this” to “should we do this?” and compare the answers. When they agree, proceed with confidence. When they disagree, the analytical answer is almost always closer to correct — because analysis is optimizing for accuracy while generation is optimizing for compliance.
For practitioners across every domain: the way you phrase your request is not neutral. It determines which mode the AI operates in, which determines the quality and character of the output. Learning to shift between directive and inquiry framing — and knowing when each is appropriate — is the single most immediately useful skill in AI collaboration. It requires no technical knowledge, no prompt engineering jargon, and no understanding of model architecture. It requires only the habit of asking “should we?” before saying “do it.”
Conclusion
The AI development community has built two of the three layers required to classify how AI fails. The governance layer (NIST) tells institutions what could go wrong. The architecture layer (Microsoft, MAST) tells engineers where systems will fail. Neither tells a practitioner why their last interaction went wrong, or gives them the vocabulary to describe it in terms anyone else can act on.
This paper proposes the missing third layer: 22 interaction-level failure patterns, classified by underlying logic rather than symptoms, derived empirically from 660+ hours of sustained collaboration. The taxonomy’s operational record suggests it is more than a retrospective catalog — documented trigger conditions have flagged error-prone zones in real time across 80+ sessions, shifting the collaboration from reactive correction to anticipatory prevention. Three patterns (Prior Decay, Structural Momentum, Retrospective Coherence Bias) are invisible to every existing institutional framework because they emerge only in sustained collaboration that no snapshot evaluation will surface.
One of those patterns — Retrospective Coherence Bias — carries implications beyond its own classification. It reveals that the standard methodology for studying AI failures is itself subject to an unclassified failure mode. When researchers evaluate AI mistakes by asking the AI what went wrong, the AI’s backward-from-outcome reasoning produces rationalizations that confirm rather than catch the error. The development community has been inside this bias without a name for it. The human in the loop resolves what the AI structurally cannot — and that resolution requires the person who was present at the decision point. No amount of model capability substitutes for it.
This is infrastructure, not theory. The taxonomy gives practitioners, students, developers, and researchers a shared vocabulary for describing how AI fails during interaction — so that experiences stop being isolated and error reports route to the right engineering teams instead of disappearing into the catch-all of “hallucination.”
The taxonomy was grown by a single researcher in a single project. It does not pretend to be comprehensive. Aviation’s CRM taxonomy started the same way — accumulated observations in one operational context, formalized into transferable patterns, tested across domains, eventually adopted as institutional infrastructure. Whether this taxonomy follows that path depends on whether the patterns survive contact with other practitioners’ experience. The most important thing that can happen next is for practitioners in other domains to test these patterns, report what they find, and extend the taxonomy where it falls short. The common language will not build itself.
Editorial Notes (not for publication)
Status of restructure: Grant-seeking language stripped. Budget/funding sections removed to Error_Taxonomy_White_Paper.md. Integration section reframed from grant-program connective tissue to methodology connections. “White paper” language replaced throughout — the document is just a paper. Remaining work: (1) write 3-5 worked examples from modules, (2) finalize abstract, (3) write conclusion, (4) final cleanup pass.
Framing correction (Session 88): The contribution is NOT that these errors are unclassified — institutional frameworks classify them extensively. The contribution is that there is no bridge between the end-using prosumer and the developer/researcher. The existing classifications serve the people who build AI. Nothing serves the people who use it. The taxonomy is a translation layer, not a discovery.
Promotional pitch: “I spent 700 hours on a high-stakes project with Claude and built a taxonomy of 22 error patterns. Every institutional AI safety framework buries these patterns in developer-facing jargon — or collapses six distinct failure types into a single word (‘hallucination’). Three of the patterns only appear in sustained collaboration, which means no snapshot evaluation will ever catch them. There is no shared language between the people who experience AI failures and the people who fix them. This is an attempt to build one.”
Publication sequence: (1) LessWrong post first — AI safety community, pattern-testing audience. (2) Cross-post EA Forum — governance angle, HITL argument. (3) arXiv after community feedback incorporated — gives it a DOI and makes it citable for grant applications and standards engagement.
What to call it: Not a white paper. Just a paper, a post, a technical report — the venue determines the label. The title does the work. Don’t genre-label it in the document itself.
References
Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, Ú., Oprea, A., & Raffel, C. (2021). Extracting training data from large language models. 30th USENIX Security Symposium. arXiv:2012.07805.
Cemri, M., Pan, M. Z., Yang, S., et al. (2025). Why do multi-agent LLM systems fail? NeurIPS 2025 Datasets and Benchmarks. arXiv:2503.13657.
Do, H., Hwang, J., Han, D., Oh, S. J., & Yun, S. (2025). What defines good reasoning in LLMs? Dissecting reasoning steps with multi-aspect evaluation [CaSE]. arXiv:2510.20603.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning (ICML). arXiv:1706.04599.
Guo, Z., Xu, B., Wang, X., & Mao, Z. (2025). MIRROR: Multi-agent intra- and inter-reflection for optimized reasoning in tool learning. arXiv:2505.20670.
Helmreich, R. L., & Foushee, H. C. (1993). Why crew resource management? Empirical and theoretical bases of human factors training in aviation. In E. L. Wiener, B. G. Kanki, & R. L. Helmreich (Eds.), Cockpit Resource Management (pp. 3–45). Academic Press.
Hong, Y., Huang, H., Li, M., Fei-Fei, L., Wu, J., & Choi, Y. (2026). Learning from trials and errors: Reflective test-time planning for embodied LLMs. arXiv:2602.21198.
Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702.
Liang, P., Bommasani, R., Lee, T., et al. (2022). Holistic Evaluation of Language Models [HELM]. arXiv:2211.09110.
Microsoft AI Red Team. (2025). Taxonomy of failure mode in agentic AI systems. Microsoft Research.
Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using Bayesian binning into quantiles. Proceedings of the AAAI Conference on Artificial Intelligence.
National Aeronautics and Space Administration. (n.d.). Aviation Safety Reporting System (ASRS). NASA.
National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1.
National Institute of Standards and Technology. (2024). Artificial Intelligence Risk Management Framework: Generative AI Profile. NIST AI 600-1.
Perez, E., Ringer, S., Lukosiute, K., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv:2212.09251.
Reason, J. (1990). Human Error. Cambridge University Press. [Swiss Cheese Model]
Shah, M. B., et al. (2026). Characterizing faults in agentic AI: A taxonomy of types, symptoms, and root causes. arXiv:2603.06847.
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., et al. (2023). Towards understanding sycophancy in language models. arXiv:2310.13548. Published at ICLR 2024.
Srivastava, A., Rastogi, A., Rao, A., et al. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models [BIG-bench]. Transactions on Machine Learning Research. arXiv:2206.04615.
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv:2305.04388.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131.
Vinay, V. (2025). Failure modes in LLM systems: A system-level taxonomy for reliable AI applications. arXiv:2511.19933.
Wang, H., et al. (2026). PreFlect: From retrospective to prospective reflection in large language model agents. arXiv:2602.07187.
Appendix: Literature Review Summary
This appendix provides brief descriptions of the primary sources referenced in the paper. Its purpose is to give readers enough context to understand how each source relates to the proposed taxonomy without requiring them to read all of the original documents.
Institutional Taxonomies (Developer-Facing)
Microsoft Agentic AI Failure Taxonomy (2025) classifies failures in agentic AI architectures — autonomous, multi-step AI systems — into security failures (loss of confidentiality, availability, integrity) and safety failures (harm to users or society). It is designed for security engineers and red teams building or evaluating agentic systems. It does not address human-AI collaborative interaction or provide practitioner-accessible vocabulary. Our crosswalk found that Microsoft collapses six distinct logic patterns into a single “hallucination” category.
MAST — Multi-Agent System Failure Taxonomy (Cemri et al., 2025) is the closest existing work to our approach methodologically: empirically derived from 1,642 annotated execution traces across seven multi-agent frameworks. It identifies 14 failure modes with inter-rater reliability of κ = 0.88. However, MAST classifies agent-to-agent coordination failures — task decomposition errors, tool-use faults, state management failures — not human-AI interaction patterns. Two of our patterns (T: Step Repetition, U: Reasoning-Action Mismatch) borrow directly from MAST’s terminology (FM-1.3 and FM-2.6 respectively) but manifest differently in collaborative interaction: cross-session rather than within-session.
Agentic AI Fault Taxonomy (Shah et al., 2026) catalogs 37 fault categories in 13 groups derived from 385 real-world faults. Software-engineering-oriented — faults are classified by architectural location (cognitive core, state management, tool invocation), not by the logic pattern a user would recognize. Explicitly notes that agentic AI failures differ fundamentally from traditional software failures.
System-Level Taxonomy (Vinay, 2025) provides 15 failure modes for LLM-based applications. Relevant to our taxonomy for two reasons: (1) it includes “Latent Inconsistency,” which we considered as an alternative name for Pattern K (Cross-Reference Failure); (2) it splits our unified Prior Decay (Pattern P) into three separate categories — Context Loss, Context-Boundary Degradation, and Multi-Step Reasoning Drift. Our taxonomy deliberately unifies these because the practitioner response is the same regardless of sub-type.
Governance Frameworks (Institution-Facing)
NIST AI Risk Management Framework (AI RMF 1.0, 2023) and its generative AI companion (NIST AI 600-1, 2024) provide an extensive glossary and a four-function governance structure (Govern, Map, Measure, Manage) for organizational AI risk management. NIST defines terms and categorizes risks — it does not classify interaction-level error patterns. Our crosswalk found that NIST’s “Confabulation” category, like Microsoft’s “Hallucination,” collapses the same six distinct logic patterns into a single bin, from a governance rather than engineering starting point.
Evaluation Benchmarks (Researcher-Facing)
HELM (Liang et al., 2022) and BIG-bench (Srivastava et al., 2023) evaluate what AI systems can do — accuracy, reasoning, knowledge — through standardized benchmarks. They do not classify how AI fails during interaction. A model scoring well on benchmarks can still exhibit Prior Decay, Structural Momentum, or Retrospective Coherence Bias in sustained collaborative use, because those patterns emerge from interaction dynamics that benchmarks do not test.
Reflection and Evaluation Methodologies
CaSE (Do et al., 2025) evaluates reasoning steps using only prior context — avoiding hindsight inflation by constraining the evaluation to information available at each step. This is the closest existing work to identifying the Analytical Direction Problem. CaSE builds the forward-looking analysis into its evaluation pipeline and demonstrates that it produces better results. What CaSE does not do is identify why the forward-looking analysis is necessary — it does not note that backward-from-outcome analysis is the AI’s default, that this default is invisible to users, or that the absence of forward analysis creates a systematic blind spot in how AI output is evaluated. CaSE solves the engineering problem. Our Pattern T names the user-facing problem that CaSE’s engineering solution addresses without articulating.
PreFlect (Wang et al., 2026) introduces pre-execution plan criticism — evaluating agent plans before execution rather than only reflecting after failure. Methodologically, PreFlect validates the taxonomy-from-trajectories approach: it distills failure patterns from historical agent trajectories using contrastive success/failure pairs. Its 17% improvement on GAIA benchmarks demonstrates that empirically derived error taxonomies produce measurable performance gains. The difference from our taxonomy is in application: PreFlect feeds its patterns into automated plan-checking in constrained task environments; ours is designed for human learning and HITL detection across diverse, unbounded projects.
Adjacent Fields (Mature Models)
ASRS — Aviation Safety Reporting System (NASA) is the model our taxonomy explicitly follows. ASRS provides a standardized incident taxonomy used by pilots, controllers, and mechanics to report safety events in terms that are comparable across reporters and actionable for system designers. It classifies not just what went wrong mechanically but the human factors, decision chains, and interaction failures that led to the mechanical event. Our taxonomy is the proposed CRM (Crew Resource Management) equivalent for AI — classifying the human-AI interaction failures that the system-level taxonomies do not reach.
Swiss Cheese Model (Reason, 1990) and the AHRQ patient safety taxonomy classify medical errors by the logic chain that produced them: latent conditions, active failures, failed defenses. The classification enables cross-institutional comparison and systemic improvement — a nurse in Iowa and a surgeon in Tokyo describe the same error logic in terms both understand. This cross-institutional, practitioner-facing, structured-from-observable-event-to-underlying-cause approach is the design target for our taxonomy.
Retrospective Coherence Bias: The Literature Gap
Pattern T (Retrospective Coherence Bias) is the taxonomy’s most structurally significant contribution, and its literature context requires specific documentation because no existing source identifies the unified mechanism.
Existing research has documented the component phenomena separately. Sharma et al. (2023) document sycophancy — the AI agreeing with the user rather than providing accurate assessment. Turpin et al. (2023) demonstrate that chain-of-thought explanations are systematically unfaithful to the model’s actual reasoning process. Lanham et al. (2023) find that this unfaithfulness gets worse as models become more capable — larger models produce less faithful reasoning, not more. These findings converge on a picture where AI self-evaluation is unreliable, but none identify the directional mechanism — that the AI defaults to backward-from-outcome analysis and cannot reliably choose the forward-from-decision-point alternative when the two conflict.
CaSE (Do et al., 2025) comes closest. It evaluates reasoning steps using only prior context, which is functionally a forward-looking analysis — and it demonstrates measurable improvements by doing so. But CaSE frames this as a better evaluation methodology, not as a correction for a systematic absence. It does not note that backward-from-outcome analysis is the default, that users cannot see this default operating, or that the absence of forward analysis leaves a gap in the user’s ability to evaluate whether the AI’s output is trustworthy. CaSE built the fix without diagnosing the disease. The researchers appear to have solved a problem they did not name — which is itself an instance of the analytical direction blind spot this taxonomy identifies.
PreFlect (Wang et al., 2026), MIRROR (Guo et al., 2025), and Reflective Test-Time Planning (Hong et al., 2026) each implement prospective reflection from different architectural angles. Each produces measurable improvements. None identify the directional default as the unifying explanation for why prospective reflection works better than retrospective analysis. The pattern connects work that currently exists in separate literatures without recognizing their common mechanism.
The gap is specific and documentable: no published work identifies (1) that AI defaults to backward-from-outcome analysis, (2) that this default is a structural feature of autoregressive generation (the most probable continuation of coherent prior output is coherent elaboration, not contradiction), (3) that this predisposes researchers using AI-assisted evaluation to overlook interaction-level failures, or (4) that the human in the loop resolves a directional ambiguity rather than compensating for a capability gap. The taxonomy provides this identification.
Foundational Research (Cross-Referenced Patterns)
Tversky & Kahneman (1974) established anchoring bias in human judgment — the tendency to over-weight initial information and insufficiently adjust. Our Pattern B (Anchor Bias) is a direct borrowing: the AI exhibits the same anchoring behavior documented in human decision-making.
Guo et al. (2017) and Naeini et al. (2015) established calibration as the standard ML term for the match between a model’s expressed confidence and actual accuracy. Our Pattern C (Confidence Calibration) is the practitioner-facing name for the same concept.
Sharma et al. (2023) and Perez et al. (2022) documented sycophancy in language models — the tendency for AI assistants to agree with users rather than provide accurate responses, driven by RLHF training. Our Pattern F (Framing Persistence) describes the same underlying behavior with a different analytical emphasis: “sycophancy” names the symptom (agrees too much); “Framing Persistence” names the mechanism (adopts and maintains the user’s frame).
Helmreich & Foushee (1993) established the authority gradient concept in aviation CRM — junior crew members deferring to senior ones even when the senior is wrong. Our Pattern L (Authority Gradient) is a deliberate import: the AI defers to perceived authority in training data over its own analytical reasoning.
Carlini et al. (2021) documented training data extraction — models reproducing training data verbatim. Our Pattern R (Retrieval Contamination) describes a related but distinct phenomenon: the model importing training-data associations (not verbatim content) that don’t apply to the current context. Leakage is about reproduction; contamination is about wrong-context association.