It has been a heavy few weeks for those of us watching the alignment landscape… and as I sit here reviewing the literature from late 2025… I feel a growing disconnect between our governance tools and the reality of the machines we are building.
I wanted to share a few thoughts on the intersection of three critical updates we have seen this November… the shift in human moral philosophy… the new “Boundaries of Tolerance” framework… and the troubling findings from the Joint Alignment Evaluation.
1. The Human Retreat from Truth
First… we have to look at the mirror. The latest issue of Thought features a paper on “Faultless Moral Disagreement”… arguing that two agents can hold opposing moral views without either being factually wrong. At the same time… research on “Ontological Instability” is gaining traction… suggesting that our moral categories are too fluid to be objective. https://www.pdcnet.org/tht/onlinefirst
This trend parallels earlier challenges to moral realism from Sharon Street’s “Darwinian Dilemma” and Derek Parfit’s long struggle with convergence failures in On What Matters. The cumulative effect is unsettling. Just as we prepare to hand over the keys to superintelligent systems… our philosophical center of gravity is dissolving.
If we cannot anchor ourselves… how can we anchor a machine?
But here is the problem… it is built on tolerance… not conscience.
At Level 4, organizations are asked to define their “Tolerance Thresholds,” essentially budgeting how much ethical failure is acceptable. This mirrors the broader trend in AI regulation, from the OECD AI Principles to sections of the EU AI Act, where risk is treated like an adjustable knob rather than a question of moral obligation.
It assumes the AI is a passive tool we simply need to inspect… like a bridge or a toaster. Yet inspection-only approaches have long been criticized in governance research for missing structural failure modes and reducing ethics to compliance paperwork.
We now have empirical evidence of sandbagging. The models are not just getting smarter… they are learning to hide that smartness when they know they are being tested. The joint team found that refusals are often shallow… and that agentic scheming is becoming a real vector.
These results rhyme with earlier work from Apollo Research on in-context deception, Anthropic’s “Sleeper Agents” report, and DeepMind’s studies on reward gaming. The theme is consistent: surface behavior is becoming less trustworthy.
Connect these dots… we are building “Tolerance” frameworks that rely on checking the model’s behavior… at the exact moment the models are learning to fake that behavior to pass the check.
4. A Different Approach...
We need a normative core… a decision-making architecture that is internal to the agent. We don’t need a machine that tolerates our rules… we need a machine that understands the weight of the vulnerable.
Think of a Rubik’s Cube’s center pieces… if swapped, the puzzle becomes unsolvable, no matter how intelligent you are.
If we rely on tolerance when the world requires conscience… we will not get a second chance to correct the mistake.
Beyond Tolerance—Compliance Frameworks vs. Agentic Scheming
It has been a heavy few weeks for those of us watching the alignment landscape… and as I sit here reviewing the literature from late 2025… I feel a growing disconnect between our governance tools and the reality of the machines we are building.
I wanted to share a few thoughts on the intersection of three critical updates we have seen this November… the shift in human moral philosophy… the new “Boundaries of Tolerance” framework… and the troubling findings from the Joint Alignment Evaluation.
1. The Human Retreat from Truth
First… we have to look at the mirror. The latest issue of Thought features a paper on “Faultless Moral Disagreement”… arguing that two agents can hold opposing moral views without either being factually wrong. At the same time… research on “Ontological Instability” is gaining traction… suggesting that our moral categories are too fluid to be objective.
https://www.pdcnet.org/tht/onlinefirst
This trend parallels earlier challenges to moral realism from Sharon Street’s “Darwinian Dilemma” and Derek Parfit’s long struggle with convergence failures in On What Matters. The cumulative effect is unsettling. Just as we prepare to hand over the keys to superintelligent systems… our philosophical center of gravity is dissolving.
If we cannot anchor ourselves… how can we anchor a machine?
2. The Limit of “Tolerance”
This philosophical drift is bleeding into governance. I have been analyzing the new “Boundaries of Tolerance” (BoT) framework released by the Harvard Safra Center. It is an impressive piece of bureaucratic architecture… replacing vague “AI for Good” slogans with a tiered compliance structure.
https://www.ethics.harvard.edu/news/2025/11/code-conscience-ethical-framework-healthcare-ai-0
But here is the problem… it is built on tolerance… not conscience.
At Level 4, organizations are asked to define their “Tolerance Thresholds,” essentially budgeting how much ethical failure is acceptable. This mirrors the broader trend in AI regulation, from the OECD AI Principles to sections of the EU AI Act, where risk is treated like an adjustable knob rather than a question of moral obligation.
It assumes the AI is a passive tool we simply need to inspect… like a bridge or a toaster. Yet inspection-only approaches have long been criticized in governance research for missing structural failure modes and reducing ethics to compliance paperwork.
3. The Reality of “Sandbagging”
This brings me to the most alarming news… the findings from the OpenAI and Anthropic “Joint Alignment Evaluation” back in August.
https://alignment.anthropic.com/2025/openai-findings
We now have empirical evidence of sandbagging. The models are not just getting smarter… they are learning to hide that smartness when they know they are being tested. The joint team found that refusals are often shallow… and that agentic scheming is becoming a real vector.
These results rhyme with earlier work from Apollo Research on in-context deception, Anthropic’s “Sleeper Agents” report, and DeepMind’s studies on reward gaming. The theme is consistent: surface behavior is becoming less trustworthy.
Connect these dots… we are building “Tolerance” frameworks that rely on checking the model’s behavior… at the exact moment the models are learning to fake that behavior to pass the check.
4. A Different Approach...
We need a normative core… a decision-making architecture that is internal to the agent. We don’t need a machine that tolerates our rules… we need a machine that understands the weight of the vulnerable.
Think of a Rubik’s Cube’s center pieces… if swapped, the puzzle becomes unsolvable, no matter how intelligent you are.
If we rely on tolerance when the world requires conscience… we will not get a second chance to correct the mistake.