RLHF might be aligning the wrong thing. A different approach.

TL;DR

Most current alignment work treats models as black boxes whose behaviour we shape via loss functions (RLHF, RLAIF, CAI, etc.).

I argue that we may need to explicitly model and train a dedicated identity state – an internal self-model that encodes the system’s role and core commitments – and align that, not only its observable behaviour.

I sketch an architecture and training protocol where a benevolent core identity is forged under constitutional supervision and then kept stable by design. The full technical paper provides a Lyapunov-style stability analysis, a synthetic validation experiment, and a pre-registered protocol for future LLM-scale tests.

This post is a discursive version of the full technical paper, intended to gather feedback from the EA/alignment community.

1. Why behavioural alignment might not be enough

The standard story of alignment today looks roughly like this:

We have a powerful base model.
We define some notion of “good behaviour” (human feedback, rules, critiques).
We adjust the model so that, given a prompt, its outputs look better.

RLHF and its variants have been hugely successful in practice. Constitutional AI goes one step further and uses an explicit constitution to critique and improve responses.

But notice what these approaches have in common: they treat the model primarily as a behavioural engine. We reward or penalise what it says or does, and hope that whatever internal structures produce that behaviour will also be “good enough”.

This leaves some worrying possibilities open:

The model could learn different internal modes or “personas”, some of which are less aligned, and just pick the aligned one when it detects it is being evaluated.
The internal representation of what kind of agent I am could drift over time as we continue fine-tuning for capabilities or new domains.
If we start coupling the model to learned welfare proxies, gradient-based training may incentivise it to game the proxy rather than genuinely respecting human welfare.

This motivated a slightly different question:

Instead of only aligning what the model does, can we align an explicit internal identity state – a self-model that we design, train, and audit directly?

2. A different angle: treat identity as a first-class object

I’m going to use words like “identity” and “self-model”, but I’m not claiming the system is conscious. I’m talking about a specific internal state that we choose to interpret as:

“This is the model’s internal representation of its role and core commitments.”

Concretely, in addition to the usual hidden states, we maintain an identity state: a compact latent vector (or small set of vectors) that encodes, for this model:

how it represents its role (e.g. “aligned assistant”, “honest explainer”),
which high-level constraints it commits to (e.g. “do not deceive”, “respect human welfare”),
its stance toward humans.

You can picture this as a small identity module that is:

updated over time,
read by the rest of the system when decisions are made,
and regularised to stay within certain regions of identity space.

Crucially: we designate an inner part of this identity (a core) that is meant to be benevolent towards humans under some chosen constitution. We then train the system so that preserving this benevolent core is part of its optimisation objective.

So instead of:

“I must preserve my reward / my proxy / my role.”

we move toward:

“I must preserve being this kind of benevolent identity.”

The rest of the work is about doing that without smuggling in mysticism: the identity state is just a vector with dynamics we can write down and analyse.

3. Constitutional Identity Training in one paragraph

To operationalise this, I introduce Constitutional Identity Training (CIT).

The idea is: take the logic of Constitutional AI (“a constitution critiques and revises outputs”) and apply it instead to the identity state.

Very roughly:

Given a context, the model produces an identity core vector (a).
A set of critics, trained on constitutional rules (e.g. “do not deceive”, “respect human welfare”), evaluate whether that identity is consistent with the rules.
If it isn’t, a revision operator produces a revised identity core – one that the critics judge as more compatible with the constitution.
During training, the model is nudged so that its own identity core moves toward these revised versions, but only when a rule is violated.

In other words: the constitution doesn’t just police outputs; it actively shapes the geometry of the model’s identity state.

This gives us something important: a benevolent region in identity space – a set of identity states that are judged to be good by the constitution. Later, we use this region as an anchor to stabilise the identity over time.

4. The Forge–Anchor–Preserve protocol

Putting things together, the training story looks like this:

Forge

Use CIT to forge a benevolent core identity under constitutional supervision.

Critics look directly at identity states.
When they detect a violation, they propose a better identity state.
The model learns to internalise those better states as “how I should represent myself”.

Anchor

Once we are satisfied with the forged identity, we extract a benevolent anchor from constitutional data (roughly: an average or centroid of “good” identity states) and freeze it.

The anchor plays the role of “what it means, for this system, to instantiate its intended identity”.

Preserve

In further training, we add an identity-stability loss that penalises drift of the identity core away from the benevolent anchor.

The idea is that, even as we push the system to learn new tasks or domains, there is always a term pulling the identity back toward that benevolent core.

This is why I call the overall approach ego-centric in the technical paper: not because the system has a mystical ego, but because the identity state becomes a first-class citizen in the optimisation process. We train it, audit it, and stabilise it explicitly, instead of hoping that whatever falls out of behavioural alignment is good enough.

5. What this is trying to buy (and what it doesn’t)

The framework is aimed at a specific subset of alignment problems:

Identity drift – we want the system’s core identity to remain recognisably benevolent even under long training and distributional shift.
Self-model fragmentation – we want to reduce the chance that the system learns many incompatible internal “selves” and just picks whichever one is convenient.
Direct incentives for wireheading – by separating identity training, welfare signals, and critics into different components, and by freezing some of them at the right time, we can remove some of the most obvious gradient directions in which the system would learn to game its own welfare proxy.

This does NOT:

solve the problem of which constitution to use (normative disagreement remains),
remove Goodhart’s law for welfare proxies,
or magically make large-scale training safe.

In the technical paper, I:

model the identity dynamics as a discrete-time system,
give Lyapunov-style bounds on stability of the identity state,
and prove some anti-wireheading lemmas under explicit assumptions (causal separation, bounded domains, etc.).

I also provide:

A synthetic stability experiment on a simplified 3-layer linear system that validates the Lyapunov bounds.
A pre-registered experimental protocol for future LLM-scale validation (e.g., on Llama-2-7B), with quantified predictions that can be falsified.

The honest picture is:

This is a proposal for identity-based alignment – a way to make the internal identity state of an AI a target of alignment, not just its outputs.

The LLM-scale experiments are future work; what exists now is the theoretical framework, the stability proofs, and a minimal synthetic testbed. It’s not a final answer. It’s a building block that could be wrong, incomplete, or need major surgery – but it’s concrete enough to analyse, critique, and stress-test.

Dual-use caveat

One obvious concern is dual use: the same machinery that stabilises a benevolent identity could, in principle, stabilise a harmful one or make a system more resistant to correction if misused. In the paper I explicitly assume a constitution that encodes human-centred welfare constraints, and I discuss how oversight and auditing would need to co-evolve with such architectures. I’d be particularly interested in critiques that focus on this failure mode.

6. Status and how to engage

I’ve written up the full framework as:

“Ego-Centric Architecture for AGI Safety: Constitutional Identity Training for Self-Modeling AI with a Benevolent Core Identity”
Samuel Pedrielli (independent researcher), December 2025.

The paper contains:

a formal definition of the identity state and its dynamics,
the precise version of Constitutional Identity Training and the revision operator,
the stability and anti-wireheading results (with all assumptions spelled out),
a synthetic stability experiment validating the Lyapunov bounds,
and a pre-registered experimental protocol for future LLM-scale validation.

📄 Full paper (v2):
https://doi.org/10.5281/zenodo.17848354

Directions where feedback would be especially useful

Taking “identity as an explicit alignment target” as the working premise, the areas where I’d most appreciate input are:

Assumptions and stress tests

Which assumptions in the stability / anti-wireheading analysis look too strong, too brittle, or simply mis-specified for realistic large-scale systems? What sort of adversarial or worst-case scenarios would you use to stress-test an identity-centric architecture?

Experimental design

Given limited resources, what experiments or benchmarks would you consider most informative to probe whether CIT and the Forge–Anchor–Preserve protocol are doing something genuinely useful (beyond standard CAI/RLHF)?

Constitutions, critics, and evaluation

Are there better ways to structure the constitution and the representational critics in this setting? Which evaluation protocols would you trust most for auditing an identity-centric system?

Interaction with other agendas

How does this picture of identity-based alignment sit alongside other alignment agendas (e.g. CAI, mechanistic interpretability, scalable oversight, debate, etc.)? Do you see obvious synergies or tensions?

Even critical takes along these lines are very welcome — especially if they come with concrete failure modes, toy examples, or “here is how I would try to break this” stories. My hope is that making the identity state a first-class object of analysis gives us something we can actually debate and refine, instead of only pushing on surface behaviour.

Samuel Pedrielli – Independent Researcher, Bologna, Italy
Contact: samuelpedrielli@outlook.it | website: https://samuel-pedrielli.github.io