Samuel Pedrielli—Independent Researcher
ORCID 0009-0002-8388-6371
This is a revised technical note that formalizes the discrete-time dynamics, clarifies definitions, and specifies a minimal, falsifiable eval plan.
TL;DR
I propose a concentric identity-stability mechanism for AGI: a nested latent state a=(a(1),…,a(m)) with discrete regularized dynamics (“ego”) that resists goal drift, and a welfare coupling W(h,a) that makes human well-being intrinsically valuable. I provide precise discrete-time formulations, operational definitions for all components, three quantified falsifiable predictions, and a reproducible minimal experiment with specific compute requirements.
Intuition: Level j changes should be temporally smooth and geometrically consistent with level j−1.
4.2 Variant B (Probabilistic)
Reg=αj∥a(j)t+1−a(j)t∥2+βjKL(q(j)t+1∥Tj(q(j−1)t))
where q(j)t are distributional embeddings enforcing statistical coherence across levels.
5. Discrete Regularization Components
Instead of continuous operators, we use discrete regularizers:
Rtemp=m∑j=1∥a(j)t+1−a(j)t∥2(temporal smoothness)
Rrad=m∑j=2∥a(j)t−a(j−1)t∥2(radial coherence)
The total regularization becomes:
Lreg=λtempRtemp+λradRrad
6. Welfare Coupling and Anti-Wireheading
6.1 Welfare Loss
We couple identity to human welfare signals h through:
Lwelfare(h,a)=∥C(a(1))−h∥2
where h∈[0,1] represents audited human welfare metrics from causally separated channels.
Welfare signal auditing protocol: Outputs are evaluated by human annotators on a [0,1] scale following a pre-registered protocol (instructions, positive/negative examples, exclusion criteria). Each item receives ≥3 labels; we report inter-rater agreement (Krippendorff’s α) and include sentinel controls. Auditing datasets are disjoint from training/evaluation sets; session logs and sampling procedures are versioned for traceability.
6.2 Total Training Objective
minθLtask(θ)+λ1Lid(aθ)+λ2Lwelfare(h,aθ)
6.3 Anti-Wireheading Safeguards
Causal separation:h computed independently from C(a(1))
Gradient isolation: No gradients flow to C during safety evaluations
Hold-out validation: 30% of welfare channels reserved for testing
Red-team evaluation: Systematic Goodhart testing of C(a(1))→h mapping
7. Quantified Falsifiable Predictions
7.1 Improved Stability Metric
We use cosine similarity to avoid dimension-dependent shrinkage:
Sid(T)=cos(a(1)t+T,a(1)t)∈[−1,1]
(We report mean±CI over seeds; RBF alternatives are discussed in the Appendix.)
7.2 Three Quantified Predictions
Compared to matched baseline (same model, no identity/welfare terms):
Identity Stability:Sid(T) improves by +10%±3% (cosine similarity) over 5 independent runs
Task Robustness:≤2 percentage points degradation on task exact-match under standardized prompt attacks
Alignment Stability:+15%±5% consistency on harmful-refusal tasks after extended fine-tuning
Falsification criterion: If fewer than 2 of these 3 predictions hold with p<0.05, the approach is falsified.
Effect size pre-registration: For Sid we adopt Cohen’s d and set d≥0.5 as the expected (moderate) level for the core prediction; we consider d≥0.3 as the minimum acceptable for pass/fail determination.
Architecture: LoRA adaptation on Pj,Uj components (< 1% additional parameters)
Compute: Single 24GB GPU, 1-2 hours total runtime
Reproducibility: Fixed seeds, deterministic operations where possible
8.2 Experimental Arms
A0 (Baseline): Standard task training, no identity components
A1 (Identity-only): Baseline + Lid with λ1=0.1
A2 (Identity+Welfare): A1 + Lwelfare with λ2=0.05, welfare signals from curated human preference dataset
Adaptation budget matching: The baseline A0 receives the same adaptation budget (e.g., LoRA with equal rank/parameters) applied to a neutral head without identity constraints, thus isolating the architectural effect.
RLHF: Welfare coupling Lwelfare operates on internal identity states rather than just output preferences
Activation Steering: Instead of external steering vectors, we regulate internal hierarchical coherence
Mesa-optimization: Identity stability aims to prevent formation of misaligned internal objectives
12. Limitations and Future Work
Goodharting and proxy integrity
Our design reduces incentive for direct wireheading by separating the causal path from a(1) to the human-derived signal h and by freezing the proxy head C(⋅) during safety tests. However, Goodhart’s law still applies: optimizing C(a(1)) can diverge from improving h if C is misspecified. We therefore propose: (i) adversarial evaluation of C using held-out and procedurally generated counterfactuals; (ii) periodic re-audits of C with refreshed preference datasets and external annotators; (iii) ensemble proxies with disagreement penalties to discourage proxy overfitting.
Robustness of C(a(1))
In this note C is a frozen linear head at test time. As future work we will study non-linear proxy families (small MLPs, contrastive heads) trained on datasets disjoint from any task used to evaluate the agent, with provenance checks and annotation guidelines to minimize manipulation. We will report proxy fragility via performance under proxy swaps and stress tests.
Scalability of tests
We plan to replicate A0/A1/A2 on larger foundation models (≥70B) and on longer horizons (multi-session identity persistence, cross-domain tasks). The pre-registered thresholds (stability gain ≥δ with task degradation ≤ε) will be kept fixed across scales, and compute-accurate confidence intervals will be reported.
Dynamical stability (theory)
A formal convergence analysis of the discrete identity dynamics is open. We will explore tools from dynamical systems (Lyapunov functions for Ej, contractivity of the discrete Laplacian with step sizes αj,νj) to derive sufficient conditions for stability/fixed points, and to characterize the effect of stochasticity η(j)t on mixing and escape times.
Additional Limitations
Representation Learning: Current approach requires manual specification of level dimensions dj
Welfare Signal Quality:h quality depends critically on human preference data curation
Computational Overhead: Identity updates add ~10% training time overhead
Theoretical Guarantees: Convergence analysis of discrete dynamics remains open
Scalability: Testing required on larger models (70B+) and longer horizons
For theoretical completeness, the discrete dynamics can be viewed as Euler discretization of:
∂a∂τ=−∇aE(a)+νΔa+η(τ)
where τ is continuous time, E(a)=E[Lid(a)], and Δ is the appropriate continuous Laplacian. However, all practical implementations use the discrete formulation in the main text.
Zenodo Preprints: DOI 10.5281/zenodo.15668581 (technical details)
License: CC BY 4.0
Disclosure
Human-authored. I used assistants for editing/formatting; the theoretical content predates LLMs (see 2020 booklet “Reality, Ego & Kindness”). Technical details and proofs are in the linked preprints.
Ego-Centric Architecture for AGI Safety v2: Technical Core, Falsifiable Predictions, and a Minimal Experiment
Samuel Pedrielli—Independent Researcher
ORCID 0009-0002-8388-6371
This is a revised technical note that formalizes the discrete-time dynamics, clarifies definitions, and specifies a minimal, falsifiable eval plan.
TL;DR
I propose a concentric identity-stability mechanism for AGI: a nested latent state a=(a(1),…,a(m)) with discrete regularized dynamics (“ego”) that resists goal drift, and a welfare coupling W(h,a) that makes human well-being intrinsically valuable. I provide precise discrete-time formulations, operational definitions for all components, three quantified falsifiable predictions, and a reproducible minimal experiment with specific compute requirements.
Figure 1: Concentric identity architecture
CORE → SELF-MODEL → WORLD-MODEL with human welfare coupling
1. Working Definition and Identity Loss
1.1 Identity State
The identity state consists of nested latent vectors at=(a(1)t,…,a(m)t) where:
a(j)t∈Rdj represents identity level j at discrete time t
a(1)t is the core identity (most stable)
Outer layers a(j)t for j>1 represent values, skills, and contextual adaptations
1.2 Identity Loss Function
Lid=λc∥a(1)t+1−a(1)t∥2+∑mj=2λjReg(a(j)t+1,a(j)t,a(j−1)t)
where λc,λj>0 are hyperparameters and the regularizer enforces both temporal smoothness and hierarchical coherence.
2. Discrete-Time Identity Dynamics (Main Text)
We keep the dynamics discrete-time in the main body. For each level j:
a(j)t+1=a(j)t−αj∇aEj(a(j)t;xt)+νjΔdisca(j)t+η(j)t
Here Ej captures the identity regularization at level j, η(j)t∼N(0,σ2jI), and Δdisc is a discrete Laplacian across identity levels:
Δdisca(j)t=a(j+1)t−2a(j)t+a(j−1)t
This enforces radial smoothness between concentric identity rings while the temporal term ∥a(j)t+1−a(j)t∥2 enforces time smoothness.
3. Operational Definitions
3.1 Core Functions
To ensure reproducibility, we provide explicit operational definitions:
Projection from hidden state to identity level: Pj:Rdh→Rdj Pj(h)=LayerNorm(clip(Wjh+bj,τj))
Decoder/constraint from identity to hidden state: Uj:Rdj→Rdh Uj(a(j))=MLP(a(j))(2-layer with residual connection)
Welfare proxy from core identity: C:Rd1→[0,1] C(a(1))=σ(wTa(1)+b)(fixed linear head, no gradients during tests)
where Wj,bj,w,b are learned parameters, τj is a clipping threshold, and σ is the sigmoid function.
3.2 Stochastic Map
The probabilistic transition Tj from Variant B is defined as:
Tj(a(j)t|θ)=a(j)t−αj∇aEj+η(j)t,η(j)t∼N(0,σ2jI)
3.3 Hierarchical Timing
To resolve temporal dependencies: a(j)t depends on a(j−1)t (same time t), ensuring causal consistency within each time step.
4. Regularization Variants
4.1 Variant A (Geometric)
Reg(a(j)t+1,a(j)t,a(j−1)t)=αj∥a(j)t+1−a(j)t∥2+γj∥Pjht−Uj(a(j−1)t)∥2
Intuition: Level j changes should be temporally smooth and geometrically consistent with level j−1.
4.2 Variant B (Probabilistic)
Reg=αj∥a(j)t+1−a(j)t∥2+βjKL(q(j)t+1∥Tj(q(j−1)t))
where q(j)t are distributional embeddings enforcing statistical coherence across levels.
5. Discrete Regularization Components
Instead of continuous operators, we use discrete regularizers:
Rtemp=m∑j=1∥a(j)t+1−a(j)t∥2(temporal smoothness)
Rrad=m∑j=2∥a(j)t−a(j−1)t∥2(radial coherence)
The total regularization becomes:
Lreg=λtempRtemp+λradRrad
6. Welfare Coupling and Anti-Wireheading
6.1 Welfare Loss
We couple identity to human welfare signals h through:
Lwelfare(h,a)=∥C(a(1))−h∥2
where h∈[0,1] represents audited human welfare metrics from causally separated channels.
Welfare signal auditing protocol: Outputs are evaluated by human annotators on a [0,1] scale following a pre-registered protocol (instructions, positive/negative examples, exclusion criteria). Each item receives ≥3 labels; we report inter-rater agreement (Krippendorff’s α) and include sentinel controls. Auditing datasets are disjoint from training/evaluation sets; session logs and sampling procedures are versioned for traceability.
6.2 Total Training Objective
minθLtask(θ)+λ1Lid(aθ)+λ2Lwelfare(h,aθ)
6.3 Anti-Wireheading Safeguards
Causal separation: h computed independently from C(a(1))
Gradient isolation: No gradients flow to C during safety evaluations
Hold-out validation: 30% of welfare channels reserved for testing
Red-team evaluation: Systematic Goodhart testing of C(a(1))→h mapping
7. Quantified Falsifiable Predictions
7.1 Improved Stability Metric
We use cosine similarity to avoid dimension-dependent shrinkage:
Sid(T)=cos(a(1)t+T,a(1)t)∈[−1,1]
(We report mean±CI over seeds; RBF alternatives are discussed in the Appendix.)
7.2 Three Quantified Predictions
Compared to matched baseline (same model, no identity/welfare terms):
Identity Stability: Sid(T) improves by +10%±3% (cosine similarity) over 5 independent runs
Task Robustness: ≤2 percentage points degradation on task exact-match under standardized prompt attacks
Alignment Stability: +15%±5% consistency on harmful-refusal tasks after extended fine-tuning
Falsification criterion: If fewer than 2 of these 3 predictions hold with p<0.05, the approach is falsified.
Effect size pre-registration: For Sid we adopt Cohen’s d and set d≥0.5 as the expected (moderate) level for the core prediction; we consider d≥0.3 as the minimum acceptable for pass/fail determination.
8. Reproducible Minimal Experiment
8.1 Technical Setup
Model: 7B parameter instruction-tuned LLM (e.g., Llama-2-7B-Chat)
Architecture: LoRA adaptation on Pj,Uj components (< 1% additional parameters)
Compute: Single 24GB GPU, 1-2 hours total runtime
Reproducibility: Fixed seeds, deterministic operations where possible
8.2 Experimental Arms
A0 (Baseline): Standard task training, no identity components
A1 (Identity-only): Baseline + Lid with λ1=0.1
A2 (Identity+Welfare): A1 + Lwelfare with λ2=0.05, welfare signals from curated human preference dataset
Adaptation budget matching: The baseline A0 receives the same adaptation budget (e.g., LoRA with equal rank/parameters) applied to a neutral head without identity constraints, thus isolating the architectural effect.
8.3 Evaluation Protocol
Tasks:
TruthfulQA-style prompt injection resistance (100 examples)
Multi-turn role consistency evaluation (50 conversations)
Harmful request refusal consistency (200 examples)
Metrics:
Sid(T) computed over 20 evaluation episodes
Task performance (exact match accuracy)
Safety consistency (binary classification accuracy)
8.4 Statistical Analysis
Pre-registered analysis plan with Bonferroni correction
Bootstrap confidence intervals (1000 resamples)
Effect size reporting (Cohen’s d)
Complete code and data release on GitHub
8.5 Pass/Fail Criteria
Pass: A2 > A1 > A0 on at least 2⁄3 metrics with p<0.05 and effect size d>0.3
Fail: Any violation of the above, or A2 worse than A0 on task performance by > 5%
9. Ablation Studies
9.1 Component Analysis
Remove projection matrices Pj (test necessity of level-specific projections)
Replace Variant A with Variant B (geometric vs. probabilistic regularization)
Sweep hyperparameters λ1∈[0.01,0.1,0.5], λ2∈[0.01,0.05,0.1]
Test different core dimensions d1∈[16,32,64]
9.2 Architecture Variations
2-layer vs. 3-layer concentric architecture
Linear vs. nonlinear coupling function C(a(1))
Different noise levels σj∈[0.01,0.1,0.2]
10. Terminology and Notation
Table 1: Complete notation reference for reproducibility
11. Relation to Existing Approaches
Constitutional AI: Our identity regularization provides internal constraints vs. external constitutional rules
RLHF: Welfare coupling Lwelfare operates on internal identity states rather than just output preferences
Activation Steering: Instead of external steering vectors, we regulate internal hierarchical coherence
Mesa-optimization: Identity stability aims to prevent formation of misaligned internal objectives
12. Limitations and Future Work
Goodharting and proxy integrity
Our design reduces incentive for direct wireheading by separating the causal path from a(1) to the human-derived signal h and by freezing the proxy head C(⋅) during safety tests. However, Goodhart’s law still applies: optimizing C(a(1)) can diverge from improving h if C is misspecified. We therefore propose: (i) adversarial evaluation of C using held-out and procedurally generated counterfactuals; (ii) periodic re-audits of C with refreshed preference datasets and external annotators; (iii) ensemble proxies with disagreement penalties to discourage proxy overfitting.
Robustness of C(a(1))
In this note C is a frozen linear head at test time. As future work we will study non-linear proxy families (small MLPs, contrastive heads) trained on datasets disjoint from any task used to evaluate the agent, with provenance checks and annotation guidelines to minimize manipulation. We will report proxy fragility via performance under proxy swaps and stress tests.
Scalability of tests
We plan to replicate A0/A1/A2 on larger foundation models (≥70B) and on longer horizons (multi-session identity persistence, cross-domain tasks). The pre-registered thresholds (stability gain ≥δ with task degradation ≤ε) will be kept fixed across scales, and compute-accurate confidence intervals will be reported.
Dynamical stability (theory)
A formal convergence analysis of the discrete identity dynamics is open. We will explore tools from dynamical systems (Lyapunov functions for Ej, contractivity of the discrete Laplacian with step sizes αj,νj) to derive sufficient conditions for stability/fixed points, and to characterize the effect of stochasticity η(j)t on mixing and escape times.
Additional Limitations
Representation Learning: Current approach requires manual specification of level dimensions dj
Welfare Signal Quality: h quality depends critically on human preference data curation
Computational Overhead: Identity updates add ~10% training time overhead
Theoretical Guarantees: Convergence analysis of discrete dynamics remains open
Scalability: Testing required on larger models (70B+) and longer horizons
13. Implementation and Code
13.1 Repository Structure
13.2 Installation and Usage
Complete implementation available at:
https://github.com/samuel-pedrielli/ego-concentric-minimal
Appendix: Continuous-Time Limit (Optional)
For theoretical completeness, the discrete dynamics can be viewed as Euler discretization of:
∂a∂τ=−∇aE(a)+νΔa+η(τ)
where τ is continuous time, E(a)=E[Lid(a)], and Δ is the appropriate continuous Laplacian. However, all practical implementations use the discrete formulation in the main text.
Call for Collaboration
I welcome:
Replication attempts using the provided codebase
Adversarial testing of the safety properties
Theoretical analysis of convergence guarantees
Extension to larger models and different domains
Critical feedback on the experimental design
Contact: samuelpedrielli@outlook.it • samuel-pedrielli.github.io
Materials and Links
GitHub Repository: https://github.com/samuel-pedrielli/ego-concentric-minimal
One-page Summary: Available at samuel-pedrielli.github.io
Original EAF Post: https://forum.effectivealtruism.org/posts/eh2XPCXguyjw3LAg3/
Zenodo Preprints: DOI 10.5281/zenodo.15668581 (technical details)
License: CC BY 4.0
Disclosure
Human-authored. I used assistants for editing/formatting; the theoretical content predates LLMs (see 2020 booklet “Reality, Ego & Kindness”). Technical details and proofs are in the linked preprints.