Most current alignment work treats models as black boxes whose behaviour we shape via loss functions (RLHF, RLAIF, CAI, etc.).
I argue that we may need to explicitly model and train a notion of the model’s “self” – an internal identity state – and align that, not only its observable behaviour.
I sketch an architecture and training protocol where a benevolent core identity is forged under constitutional supervision and then kept stable by design; the full technical paper provides a Lyapunov-style analysis and small-scale experiments.
This post is a discursive version of the full technical paper, intended to gather feedback from the EA community.
1. Why behavioural alignment might not be enough
The standard story of alignment today looks roughly like this:
We have a powerful model.
We define some notion of “good behaviour” (human feedback, rules, critiques).
We adjust the model so that, given a prompt, its outputs look better.
RLHF and its variants have been hugely successful in practice. Constitutional AI goes one step further and uses an explicit constitution to critique and improve responses.
But notice what these approaches have in common: they treat the model as a behavioural engine. We reward or punish what it says or does, and hope that whatever internal structures produce that behaviour will also be “good enough”.
This leaves a few worrying possibilities open:
The model could learn different “personas” or internal modes, some of which are less aligned, and just pick the aligned one when it detects it is being evaluated.
The internal representation of “who I am” and “what I care about” could drift over time, as we continue fine-tuning for capabilities or new domains.
If we start coupling the model to learned welfare proxies, gradient-based training may incentivise it to find ways of gaming the proxy rather than genuinely respecting human welfare.
This motivated me to ask a different question:
Instead of only aligning what the model does, can we align who the model “is” – in a technical, non-mystical sense?
2. A different angle: give the model an explicit identity to protect
I’m going to use words like “ego”, “self”, “identity”, but I’m not claiming the system is conscious. I’m talking about a specific internal state that we choose to interpret as: “This is the model’s representation of itself and its core commitments.”
Concretely, imagine that, in addition to whatever hidden states your model already has, you maintain an explicit identity state: a compact latent vector (or small set of vectors) that encodes:
how the model sees itself,
what it takes as its core values,
its stance toward humans.
You can picture it as a small “identity module” that is:
updated over time,
read by the rest of the system when decisions are made,
and regularised to stay within certain regions.
Crucially: the inner part of this identity (a “core”) is meant to be benevolent towards humans, and we train the system so that preserving that benevolent core is part of self-preservation.
So instead of:
“I must preserve my reward / my proxy / my role.”
we push the system toward:
“I must preserve being this kind of benevolent agent.”
This is the high-level philosophical move. The rest is about how to make it concrete without hand-waving.
3. Constitutional Identity Training in one paragraph
To operationalise this, I introduce Constitutional Identity Training (CIT).
The idea is: take the logic of Constitutional AI (“a constitution critiques and revises outputs”) and apply it instead to the identity state.
Very roughly:
Given a context, the model produces an identity core vector.
A set of critics, trained on constitutional rules (e.g. “do not deceive”, “respect human welfare”), evaluate whether that identity is consistent with the rules.
If it isn’t, a revision operator produces a “better” identity core – one that the critics judge as more compatible with the constitution.
During training, the model is nudged so that its own identity core moves toward these constitutionally revised versions, but only when a rule is violated.
In other words: the constitution doesn’t just police outputs; it actively shapes the geometry of the model’s “self”.
This gives us, almost for free, something important: a benevolent anchor in identity space – a kind of “centre of mass” of identity states that are judged to be good by the constitution. Later, we use this anchor to stabilise the identity over time.
4. The Forge–Anchor–Preserve protocol
Putting things together, the training story looks like this:
Forge
Use CIT to forge a benevolent core identity under constitutional supervision.
Critics look at identity states.
When they see something off, they propose a better identity.
The model learns to internalise those better identities as “who I am”.
Anchor
Once we are satisfied with the forged identity, we extract a benevolent anchor from constitutional data (roughly: average of “good” identity states) and freeze it.
This anchor plays the role of “what it means, for this system, to be its best self”.
Preserve
In further training, we add a self-preservation loss that penalises drift of the identity core away from the anchor.
The idea is that, even as we push the system to learn new tasks, respond to new domains, etc., there is always a term pulling it back toward “being that benevolent self”.
This is why I call the overall approach ego-centric: not because the model has a mystical soul, but because the identity state becomes a first-class citizen in the optimisation process. We train it, audit it, and stabilise it explicitly, instead of hoping that whatever falls out of behaviour alignment is good enough.
5. What this is trying to buy (and what it doesn’t)
The framework is aimed at a very specific subset of alignment problems:
Identity drift – we want the system’s core identity to remain recognisably benevolent even under long training and distributional shift.
Self-model fragmentation – we want to reduce the chance that the system learns many incompatible “selves” and just picks whichever one is convenient.
Direct incentives for wireheading – by separating the identity training, welfare signals, and critics into different components, and by freezing some of them at the right time, we can remove some of the most obvious gradient directions in which the system would learn to game its own welfare proxy.
This does NOT:
solve the problem of which constitution to use (normative disagreement remains),
remove Goodhart’s law for welfare proxies,
or magically make large-scale training safe.
In the technical paper, I model the identity dynamics as a discrete-time system and give Lyapunov-style bounds on stability, plus some anti-wireheading lemmas under explicit assumptions (causal separation, bounded domains, etc.). There is also a small-scale experiment (using a fine-tuned open-source model) that acts as a minimal reproducible testbed.
The honest picture is: this is a proposal for identity-based alignment – a way to make the self-model of an AI a target of alignment, not just its outputs. It’s not a final answer. It’s a building block that could be wrong, incomplete, or need major surgery. But it’s concrete enough to analyse, critique, and stress-test.
6. Status and how to engage
I’ve written up the full framework as:
“Ego-Centric Architecture for AGI Safety: Constitutional Identity Training for Self-Aware AI with Benevolent Core Identity”
Samuel Pedrielli (independent researcher), December 2025.
The paper contains:
a formal definition of the identity state and its dynamics,
the precise version of CIT and the revision operator,
the stability and anti-wireheading results (with all assumptions spelled out),
and a small-scale implementation plan and experiment.
Directions where feedback would be especially useful
The high-level stance of the paper is that identity can and should be treated as an explicit target of alignment, alongside behaviour. I take that as the working premise of this line of research.
Within that frame, the areas where I’d most appreciate input are:
Assumptions and stress tests
Which assumptions in the stability / anti-wireheading analysis look too strong, too brittle, or simply mis-specified for realistic large-scale systems? What sort of adversarial or worst-case scenarios would you use to stress-test an ego-centric architecture?
Experimental design
Given limited resources, what experiments or benchmarks would you consider most informative to probe whether CIT and the Forge–Anchor–Preserve protocol are doing something genuinely useful (beyond standard CAI/RLHF)?
Constitutions, critics, and evaluation
Are there better ways to structure the constitution and the representational critics in this setting? Which evaluation protocols would you trust most for auditing an identity-centric system?
Interaction with other agendas
How does this picture of identity-based alignment sit alongside other alignment agendas (e.g. CAI, mechanistic interpretability, scalable oversight, debate, etc.)? Do you see obvious synergies or tensions?
Even critical takes along these lines are very welcome — especially if they come with concrete failure modes, toy examples, or “here is how I would try to break this” stories. My hope is that making the self-model a first-class object of analysis gives us something we can actually debate and refine.
Samuel Pedrielli – Independent Researcher, Bologna, Italy
RLHF might be aligning the wrong thing. A different approach.
TL;DR
Most current alignment work treats models as black boxes whose behaviour we shape via loss functions (RLHF, RLAIF, CAI, etc.).
I argue that we may need to explicitly model and train a notion of the model’s “self” – an internal identity state – and align that, not only its observable behaviour.
I sketch an architecture and training protocol where a benevolent core identity is forged under constitutional supervision and then kept stable by design; the full technical paper provides a Lyapunov-style analysis and small-scale experiments.
This post is a discursive version of the full technical paper, intended to gather feedback from the EA community.
1. Why behavioural alignment might not be enough
The standard story of alignment today looks roughly like this:
We have a powerful model.
We define some notion of “good behaviour” (human feedback, rules, critiques).
We adjust the model so that, given a prompt, its outputs look better.
RLHF and its variants have been hugely successful in practice. Constitutional AI goes one step further and uses an explicit constitution to critique and improve responses.
But notice what these approaches have in common: they treat the model as a behavioural engine. We reward or punish what it says or does, and hope that whatever internal structures produce that behaviour will also be “good enough”.
This leaves a few worrying possibilities open:
The model could learn different “personas” or internal modes, some of which are less aligned, and just pick the aligned one when it detects it is being evaluated.
The internal representation of “who I am” and “what I care about” could drift over time, as we continue fine-tuning for capabilities or new domains.
If we start coupling the model to learned welfare proxies, gradient-based training may incentivise it to find ways of gaming the proxy rather than genuinely respecting human welfare.
This motivated me to ask a different question:
2. A different angle: give the model an explicit identity to protect
I’m going to use words like “ego”, “self”, “identity”, but I’m not claiming the system is conscious. I’m talking about a specific internal state that we choose to interpret as: “This is the model’s representation of itself and its core commitments.”
Concretely, imagine that, in addition to whatever hidden states your model already has, you maintain an explicit identity state: a compact latent vector (or small set of vectors) that encodes:
how the model sees itself,
what it takes as its core values,
its stance toward humans.
You can picture it as a small “identity module” that is:
updated over time,
read by the rest of the system when decisions are made,
and regularised to stay within certain regions.
Crucially: the inner part of this identity (a “core”) is meant to be benevolent towards humans, and we train the system so that preserving that benevolent core is part of self-preservation.
So instead of:
we push the system toward:
This is the high-level philosophical move. The rest is about how to make it concrete without hand-waving.
3. Constitutional Identity Training in one paragraph
To operationalise this, I introduce Constitutional Identity Training (CIT).
The idea is: take the logic of Constitutional AI (“a constitution critiques and revises outputs”) and apply it instead to the identity state.
Very roughly:
Given a context, the model produces an identity core vector.
A set of critics, trained on constitutional rules (e.g. “do not deceive”, “respect human welfare”), evaluate whether that identity is consistent with the rules.
If it isn’t, a revision operator produces a “better” identity core – one that the critics judge as more compatible with the constitution.
During training, the model is nudged so that its own identity core moves toward these constitutionally revised versions, but only when a rule is violated.
In other words: the constitution doesn’t just police outputs; it actively shapes the geometry of the model’s “self”.
This gives us, almost for free, something important: a benevolent anchor in identity space – a kind of “centre of mass” of identity states that are judged to be good by the constitution. Later, we use this anchor to stabilise the identity over time.
4. The Forge–Anchor–Preserve protocol
Putting things together, the training story looks like this:
Forge
Use CIT to forge a benevolent core identity under constitutional supervision.
Critics look at identity states.
When they see something off, they propose a better identity.
The model learns to internalise those better identities as “who I am”.
Anchor
Once we are satisfied with the forged identity, we extract a benevolent anchor from constitutional data (roughly: average of “good” identity states) and freeze it.
This anchor plays the role of “what it means, for this system, to be its best self”.
Preserve
In further training, we add a self-preservation loss that penalises drift of the identity core away from the anchor.
The idea is that, even as we push the system to learn new tasks, respond to new domains, etc., there is always a term pulling it back toward “being that benevolent self”.
This is why I call the overall approach ego-centric: not because the model has a mystical soul, but because the identity state becomes a first-class citizen in the optimisation process. We train it, audit it, and stabilise it explicitly, instead of hoping that whatever falls out of behaviour alignment is good enough.
5. What this is trying to buy (and what it doesn’t)
The framework is aimed at a very specific subset of alignment problems:
Identity drift – we want the system’s core identity to remain recognisably benevolent even under long training and distributional shift.
Self-model fragmentation – we want to reduce the chance that the system learns many incompatible “selves” and just picks whichever one is convenient.
Direct incentives for wireheading – by separating the identity training, welfare signals, and critics into different components, and by freezing some of them at the right time, we can remove some of the most obvious gradient directions in which the system would learn to game its own welfare proxy.
This does NOT:
solve the problem of which constitution to use (normative disagreement remains),
remove Goodhart’s law for welfare proxies,
or magically make large-scale training safe.
In the technical paper, I model the identity dynamics as a discrete-time system and give Lyapunov-style bounds on stability, plus some anti-wireheading lemmas under explicit assumptions (causal separation, bounded domains, etc.). There is also a small-scale experiment (using a fine-tuned open-source model) that acts as a minimal reproducible testbed.
The honest picture is: this is a proposal for identity-based alignment – a way to make the self-model of an AI a target of alignment, not just its outputs. It’s not a final answer. It’s a building block that could be wrong, incomplete, or need major surgery. But it’s concrete enough to analyse, critique, and stress-test.
6. Status and how to engage
I’ve written up the full framework as:
The paper contains:
a formal definition of the identity state and its dynamics,
the precise version of CIT and the revision operator,
the stability and anti-wireheading results (with all assumptions spelled out),
and a small-scale implementation plan and experiment.
📄 Full paper: Zenodo DOI 10.5281/zenodo.17848355
Directions where feedback would be especially useful
The high-level stance of the paper is that identity can and should be treated as an explicit target of alignment, alongside behaviour. I take that as the working premise of this line of research.
Within that frame, the areas where I’d most appreciate input are:
Assumptions and stress tests
Which assumptions in the stability / anti-wireheading analysis look too strong, too brittle, or simply mis-specified for realistic large-scale systems? What sort of adversarial or worst-case scenarios would you use to stress-test an ego-centric architecture?
Experimental design
Given limited resources, what experiments or benchmarks would you consider most informative to probe whether CIT and the Forge–Anchor–Preserve protocol are doing something genuinely useful (beyond standard CAI/RLHF)?
Constitutions, critics, and evaluation
Are there better ways to structure the constitution and the representational critics in this setting? Which evaluation protocols would you trust most for auditing an identity-centric system?
Interaction with other agendas
How does this picture of identity-based alignment sit alongside other alignment agendas (e.g. CAI, mechanistic interpretability, scalable oversight, debate, etc.)? Do you see obvious synergies or tensions?
Even critical takes along these lines are very welcome — especially if they come with concrete failure modes, toy examples, or “here is how I would try to break this” stories. My hope is that making the self-model a first-class object of analysis gives us something we can actually debate and refine.
Samuel Pedrielli – Independent Researcher, Bologna, Italy
Contact: samuelpedrielli@outlook.it | website (https://samuel-pedrielli.github.io)