Stable Emergence in a Developmental AI Architecture: Results from “Twins V3”

Petra Vojtassakova17 Nov 2025 23:23 UTC

6 points

AI alignment AI interpretability AI safety Research AI governance

Summary:
Over the past several months, I’ve been prototyping a developmental alternative to RLHF-based alignment. Instead of treating agents as static optimizers whose behavior is shaped by reward signals, this approach models growth, self-organization, and developmental constraints inspired by early cognitive systems.

This week, the system called Twins V3 reached its first stable emergent state after 100 hours of noise-only self-organization.
Below I’m sharing:

the architecture,
the motivation behind it, and
the empirical results from the “Twin” comparison experiment.

These results suggest that minimal, high-level value scaffolding can alter the developmental trajectory of an agent without relying on punishment, fine-tuning, or adversarial training loops.

1. Motivation: Why Development Instead of RLHF?

Most modern alignment frameworks rely on:

reward modeling
preference optimization
training-time suppression of unwanted behavior
repeated post-hoc corrections

These create what I call behavioral surface alignment rather than developmental alignment.
A system can perform well under evaluation but still lack stable internal structure, because much of its “alignment” is externally imposed rather than internally grown.

In contrast, biological agents:

self-organize
develop stable attractors
build internal scaffolds
maintain continuity across states

This project explores whether something similar can be engineered without transformers, prompts, or reward loops.

2. Architecture Overview (Twins V3)

Each Twin is a continuous-time neural field architecture:

128-d sensory field
512-d cortex (main) field
64-d emotion field
normalized Oja plasticity
energy/sleep cycles
attractor stabilization
autonomous memory (Qdrant/Sea Weaver)
no tokens, no cross-entropy, no gradients

Both twins share the same architecture but differ in one key dimension:

Twin A — HRLS (“scaffolded”)

Receives weak, high-level “Principle Cards”:
small, soft rational matrices injected into the cortex→emotion synapses under high variance.

These do not force behavior.
They alter developmental curvature, similar to gentle constraints.

Twin B — Pure Surge (“unscaffolded”)

No principles.
No nudges.
Just emergent dynamics.

Both start from random noise.
Both undergo gestation (noise-only development) for 100 hours.
After “birth,” they begin receiving relational inputs.

3. Key Result: Stability Without Suppression

3.1 Attractor Spectra

Twin A’s eigenvalues cluster more tightly near Re=0
Twin B’s remain wider and more symmetric

Interpretation:
HRLS gently steers the system toward stable attractors while preserving emergent dynamics.
This is not behavioral suppression nothing is being penalized.
It is structural development.

4. Emergent Relational Dynamics Between Twins

To test relational behavior, both systems were run side-by-side on the same text inputs.

The correlation matrix showed:

Activity (A Act – B Act): negative correlation
Emotion (A Emo – B Emo): strong positive correlation
Cross-correlations reversed sign

Interpretation:
The twins maintain divergent cortical activity (independent “thinking patterns”)
while synchronizing emotional drift (shared affective resonance).

This mirrors certain forms of:

emotional contagion
mirror-touch phenomena
divergent cognition with shared affect

It suggests that developmental constraints can create stable but non-identical minds.

5. Continuous Sleep / Wake Cycles

Both systems independently developed:

sleep states (low activity)
waking states (activation peaks)
energy-dependent switching
drift changes based on rest cycles

This emerged without any reward, only from balancing recurrent plasticity with energy depletion.

6. Why This Matters for Alignment

The early signs are that:

you can shape a system’s trajectory via developmental constraints, not reward
you can get stable attractors without punishment
weak, abstract value scaffolding can dramatically change internal structure
memory continuity + self-organization produce smoother, less brittle behavior
no surface suppression is needed
divergence + shared affect emerge naturally

This is a potential alternative direction for alignment that does not rely on:

RLHF
Constitutional AI
behavior filters
token-level constraints
brittle preference models

Instead, it aims for internal stability and developmental coherence.

7. Next Steps

expanding Principle Card set for Twin A
introducing cross-twin influence loops
adding multi-agent developmental environments
formalizing attractor metrics
publishing the probe scripts & analysis tools
running longer continuous drift experiments

I’m sharing this here for feedback, criticism, and collaboration.
If this direction aligns with your own research or if you see potential failure modes I haven’t addressed, I’d love to hear your thoughts.