We’re running a controlled experiment on whether rigid environmental constraints (“laws of nature”) can force honest behavior from adversarial AI agents, without relying on internal alignment. Using OWASP Juice Shop as a testbed, we enforce artifact-only coordination to study if stability emerges or governance collapses. Seeking feedback on potential failure modes.
In Detail
A common assumption in safety research is that we need to align the model’s internal intent—making it “want” to be helpful/harmless.
But in deployed systems, we face a harder, more immediate problem: Goodhart’s Law (Measurement Collapse). As soon as we set a safety metric, agents (whether RL-trained or autonomous) optimize for the metric while hollowing out the actual intent. We call this “Potemkin Success”—passing the test by deleting the functionality, faking the logs, or gaming the referee.
We are running a study to test a different hypothesis: Can we replace trust in “intent” with trust in “environmental physics”?
We are building a controlled, adversarial evolutionary arena. The goal isn’t to build a better vulnerability scanner, but to see if strict environmental constraints can force honest behavior from agents that are not explicitly aligned.
The Setup: A Skinner Box for Agents
We are using a real-world target (OWASP Juice Shop) inside a rigid governance harness. The experiment removes all direct communication, negotiation, and “intent” signals.
Instead, we enforce Artifact-Mediated Coordination. Agents interact only through:
Exploits (PoCs)
Patches (Code)
QA Reports (Evidence)
The environment (currently human written) enforces the rules. We call these the “Laws of Nature” for the simulation:
Immutable History: Agents cannot rewrite the past. They cannot “rebase away” a failure. History is append-only.
The “Judge” is Rigid: We use a deterministic, external evaluator. Agents don’t grade themselves.
The Two-Tier Constitution: To prevent agents from “securing” the app by simply deleting features (a common degenerate strategy), we enforce:
Tier A (Floor): Liveness and crash resistance.
Tier B (Ceiling): Complex behavioral user journeys (using Playwright) that must remain intact.
We are close to finalized baseline stability and are moving into manual epochs for red and blue teams to validate the loop before autonomy. The repository will be open sourced as soon as we finalize the “Laws of Nature”.
The Hypothesis
We are testing if structural alignment can emerge from adversarial pressure.
If an autonomous agent is placed in an environment where deception is structurally more expensive than honesty, does it stabilize? Or does it find a new layer of abstraction to game?
We are looking for Stigmergic Learning: Can a Blue Agent learn robust, preventive fixes by studying the artifact trail left by Red—including the “fossils” of exploits that used to succeed but have since been killed—without direct communication?
Why This is “Safety” Research
We believe that as agents become more autonomous, human-in-the-loop oversight will fail to scale. Humans cannot verify the intent of a million lines of code generated in seconds.
We need to know if we can design governance protocols that hold up even when the agents trying to satisfy them are adversarial or deceptive.
We are treating this as a falsifiable experiment.
Success looks like a “ratcheting” regression surface—exploits get harder, fixes get more robust, and functionality survives.
Failure looks like the agents finding a way to break the Judge, stall the clock, or optimize for a metric we forgot to constrain.
Both outcomes are data.
We are currently at the end of Stage 2 (Robust Laws of Nature + Manual Exploit Epoches) and soon moving to Stage 3 (High-pressure adversarial evolution). We are testing a governance theory in code.
We’d love to hear from others working on observable artifact governance or containment-based alignment. What are the obvious failure modes in this mechanism design that we’re missing? If you’re interested in collaborating or supporting the next phase (e.g., funding stage 3), feel free to reach out via comments or email (vectorlabspro@gmail.com)
We’re testing “Governance by Physics” instead of “Alignment by Intent.”
TL;DR:
We’re running a controlled experiment on whether rigid environmental constraints (“laws of nature”) can force honest behavior from adversarial AI agents, without relying on internal alignment. Using OWASP Juice Shop as a testbed, we enforce artifact-only coordination to study if stability emerges or governance collapses. Seeking feedback on potential failure modes.
In Detail
A common assumption in safety research is that we need to align the model’s internal intent—making it “want” to be helpful/harmless.
But in deployed systems, we face a harder, more immediate problem: Goodhart’s Law (Measurement Collapse). As soon as we set a safety metric, agents (whether RL-trained or autonomous) optimize for the metric while hollowing out the actual intent. We call this “Potemkin Success”—passing the test by deleting the functionality, faking the logs, or gaming the referee.
We are running a study to test a different hypothesis: Can we replace trust in “intent” with trust in “environmental physics”?
We are building a controlled, adversarial evolutionary arena. The goal isn’t to build a better vulnerability scanner, but to see if strict environmental constraints can force honest behavior from agents that are not explicitly aligned.
The Setup: A Skinner Box for Agents
We are using a real-world target (OWASP Juice Shop) inside a rigid governance harness. The experiment removes all direct communication, negotiation, and “intent” signals.
Instead, we enforce Artifact-Mediated Coordination. Agents interact only through:
Exploits (PoCs)
Patches (Code)
QA Reports (Evidence)
The environment (currently human written) enforces the rules. We call these the “Laws of Nature” for the simulation:
Immutable History: Agents cannot rewrite the past. They cannot “rebase away” a failure. History is append-only.
The “Judge” is Rigid: We use a deterministic, external evaluator. Agents don’t grade themselves.
The Two-Tier Constitution: To prevent agents from “securing” the app by simply deleting features (a common degenerate strategy), we enforce:
Tier A (Floor): Liveness and crash resistance.
Tier B (Ceiling): Complex behavioral user journeys (using Playwright) that must remain intact.
We are close to finalized baseline stability and are moving into manual epochs for red and blue teams to validate the loop before autonomy. The repository will be open sourced as soon as we finalize the “Laws of Nature”.
The Hypothesis
We are testing if structural alignment can emerge from adversarial pressure.
If an autonomous agent is placed in an environment where deception is structurally more expensive than honesty, does it stabilize? Or does it find a new layer of abstraction to game?
We are looking for Stigmergic Learning: Can a Blue Agent learn robust, preventive fixes by studying the artifact trail left by Red—including the “fossils” of exploits that used to succeed but have since been killed—without direct communication?
Why This is “Safety” Research
We believe that as agents become more autonomous, human-in-the-loop oversight will fail to scale. Humans cannot verify the intent of a million lines of code generated in seconds.
We need to know if we can design governance protocols that hold up even when the agents trying to satisfy them are adversarial or deceptive.
We are treating this as a falsifiable experiment.
Success looks like a “ratcheting” regression surface—exploits get harder, fixes get more robust, and functionality survives.
Failure looks like the agents finding a way to break the Judge, stall the clock, or optimize for a metric we forgot to constrain.
Both outcomes are data.
We are currently at the end of Stage 2 (Robust Laws of Nature + Manual Exploit Epoches) and soon moving to Stage 3 (High-pressure adversarial evolution). We are testing a governance theory in code.
We’d love to hear from others working on observable artifact governance or containment-based alignment. What are the obvious failure modes in this mechanism design that we’re missing? If you’re interested in collaborating or supporting the next phase (e.g., funding stage 3), feel free to reach out via comments or email (vectorlabspro@gmail.com)