if you mean a feedback loop involving actions into the world and then observations going back to the AI,
Yes, I mean this basically.
i insist that in one-shot alignment, this is not a thing at least for the initial AI, and it has enough leeway to make sure that its single-action, likely itself an AI, will be extremely robust.
I can insist that a number can be divided by zero as the first step of my reasoning process.
That does not make my reasoning process sound.
Nor should anyone here rely on you insisting that something is true as the basis of why machinery that could lead to the deaths of all current living species on this planet could be aligned after all – to be ‘extremely robust’ in all its effects on the planet.
The burden of proof is on you.
a one-shot aligned AI (let’s call it AI₀) can, before its action, design a really robust AI₁ which will definitely keep itself aligned, be equipped with enough error-codes to ensure that its instances will get corrupted approximately 0 times until heat death
You are attributing a magical quality to error correction code, across levels of abstraction of system operation, that is not available to you nor to any AGI.
I see this more often with AIS researchers with pure mathematics or physics backgrounds (note: I did not check yours).
There is a gap in practical understanding of what implementing error correction code in practice necessarily involves.
See below a text I wrote 9 months ago (with light edits) regarding the limits of error correction in practice. It was one of 10+ attempts to summarise Forrest Landry’s arguments, which accumulated in this forum post 🙂
If you want to talk more, also happy to have a call. I realise I was quite direct in my comments. I don’t want that to come across as rude. I really appreciate your good-faith effort here to engage with the substance of the post. We are all busy with our own projects, so the time you spent here is something I’m grateful for!
I want to make sure we maintain integrity in our argumentation, given what’s at stake. If you are open to going through the reasoning step-by-step, I’d love to do that. Also understand that you’ve got other things going on.
~ ~ ~
4. Inequality of Monitoring
Takes more code (multiple units) to monitor local environmental effects of any single code unit.
We cannot determine the vast majority of microscopic side-effects that code variants induce and could get selected for in interaction with the surrounding environment.
Nor could AGI, because of a macroscopic-to-microscopic mismatch: it takes a collection of many pieces of code, say of neural network circuits, to ‘kinda’ determine the innumerable microscopic effects that one circuit running on hardware has in interaction with all surrounding (as topologically connected) and underlying (as at lower layers of abstraction) virtualized and physical circuitry.
In turn, each circuit in that collection will induce microscopic side-effects when operated – so how do you track all those effects? With even more and bigger collections of circuits? It is logically inconsistent to claim that it is possible for internals to detect and correct (and/or predict and prevent) all side-effects caused by internals during computation.
Even if able to generally model and exploit regularities of causation across macroscopic space, it is physically impossible for AGI to track all side-effects emanating from their hardware components at run-time, for all variations introduced in the hardware-embedded code (over >10² layers of abstraction; starting lower than the transistor-bit layer), contingent with all possible (frequent and infrequent) degrees of inputs and with all possible transformations/changes induced by all possible outputs, via all possibly existing channels from and to the broader environment.
Note emphasis above on interactions between code substrate and the rest of the environment, at the microscopic level all the way to at the macroscopic level. To quote Eliezer Yudkowsky: “The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.
Q: What about scaling up capability so an AGI can track more side-effects simultaneously?
Scaling capability of any (superficially aligned) AI make them worse-equipped at tracking all interactions between/with internals. The number of possible interactions (hypothetically, if they were countable) between AI components and the broader environment would scale at minimum exponentially with a percentage-wise scaling of AI the components.
Scaling interpretability schemes is counterproductive too in that it leads researchers to miscalibrate even more on what general capabilities and degrees of freedom of interaction (eg. closed-loop, open-ended, autonomous) they can safely allow the interpreted ML architectures to scale to. If, for example, you would scale up interpretation to detect and correct out any misaligned mesa-optimiser, the mesa-optimisers you are leaving to grow in influence are those successfully escaping detection (effectively deceiving researchers into miscalibrated beliefs). Same goes for other locally selected-for optimisers, which we will get to later.
5. Combinatorial Complexity of Machine Learning
Increasingly ambiguous to define & detect novel errors to correct at higher abstraction layers.
Mechanistic interpretability emphasizes first inspecting neural network circuits, then piecing the local details of how those circuits work into a bigger picture of how the model functions. Based on this macroscopic understanding of functionality, you would then detect and correct out local malfunctions and misalignments (before these errors overcome forward pass redundancies).
This is a similar exercise to inspecting how binary bits stored on eg. a server’s harddrive are logically processed – to piece together how the architecture stack functions and malfunctions:
Occasionally, a local bit flips (eg. induced by outside electromagnetic interference). So you make redundant copies of the binary code to compare and correct against.
At the packet layer, you find distortions in packets transmitted over wires to topologically adjacent hardware. You append CRC checksums to correct those errors.
At the application layer, you find that a Trojan horse transmitted from adjacent hardware caused an application to malfunction. You add in virus detection signatures.
At the layer of neural networks, trained through an application running on the server, you fear that more complex Trojan horses could infiltrate this layer too.
Each time we scale the architecture to a new layer of abstraction, we discover that we need to solve a completely new set of problems. Over time, variance is introduced into code computed at this layer, and some of this variance may lead to malfunctions. That a mechanism corrects errors at a lower layer says little about how to detect and correct errors at the new higher layer.
The higher up the architecture stack you design for, the more ambiguous and multi-faceted in meaning is the ‘error’ to correct out. Mechanisms you design for a neural network must detect over a combinatorial explosion of possible causal connections to correct out all defined errors.
Now, there is no mechanism to make a deep neural network receiving Internet data robustagainst any but a tiny fraction of possible Trojan backdoor attacks for which countermeasures exist, countermeasures that are easily circumvented. Here, an adversary introduces variation into the input data that a model is training on, such that in later deployment a chosen (series of) inputs will trigger the model to behave out of line. The model is set to act like a perfect Manchurian Candidate. For the model’s developers, under usual interactions, the model plainly serves some overall purpose. In parallel, the model is storing latent causal structure that under particular interactions with the outside environment causes outputs and cascading effects.
For the adversary, the model also serves another, conflicting purpose. In effect, they select for internal variance misaligned with the original developers’ purpose. Worse, if ‘random’ initialisation of internals was directed by the adversary, they could code in persistent misalignment that is undetectable – making it computationally infeasible for an interpreter with full access to internals to find a singleinput for which the model’s outputs differ from the outputs of an untampered copy (for details, see section 2.3 of the paper).
6. Delays in Measuring Changes
Time delay in measuring (and correcting) the extent of any misalignment detectable in the monitored (or simulated) changes caused by an internal code unit.
Any decision-theoretical premise of coherent unitary agency of future AGI fails to account for locally selected-for effects. It asserts a priori that any AGI would consistently and continuously operate as, and thus can be distilled and represented monolithically as, a coherent agentic unit. More precisely, that ‘AGI’ can be soundly represented as (a) single unit(s) that make(s) decisions over time based on (orderings of) preferences that are consistent with each other.
Not so in practice: if AGI internals induce local effects that loop out of the control of higher-level regulation (in reference to eg. decided metric extents or numbers of end states), then lower-level behavior is inconsistent with higher-level decision-making. But even internals detect & correct (or predict & prevent) internal inconsistencies, can only do so discontinuously.
There is a time delay to inspecting, say, a new circuit or module. Take for example, a scheme that measures the extent to which the variant, as computed in interaction with connected embedded code, causes external effects that diverge (are misaligned) from reference values. But to measure, code’s detected/predicted effects are first transformed to be commensurable (measured by a common unit standard against the reference) over dimensions learned to be semantically descriptive of mis/alignment (as eg. generalized from past feedback expressed by humans within environmental contexts that they and/or the AI were operating in at the time).
Then, over the number of clock cycles this scheme took to detect the internal misalignment (let alone correct the misalignment, without breaking the system in another way), any nonlinear effects of the computed code variant may have looped outside the bounds of what the scheme can detect and correct for. The scheme’s mechanistic operations cannot predetermine what happens during the gap between inspecting code and measuring+correcting for misalignment.
Q: But AGI is a fast global optimiser, so it must be fast enough at detecting or predicting the effects of local variants before any misaligned effects break outside AGI, right?
Any superficially human-aligned AGI running on digital hardware would take some huge multiple of a clock cycle to inspect (and if misaligned, correct) each of all pieces of internal code that could be running over a next period.
Q: Could AGI not pause computation of that internal code in the meantime?
AGI pausing computation of all internal code while inspecting/interpreting internals is not an option, because the inspector/interpreter is effectively topologically connected within the internals. Instead, AGI could continue computing non-varied code, including the interpreter, but how can one ascertain that no changes have actually been made to ‘non-varied’ hardware memory locations since the last inspection point through any physical pathway (e.g quantum jumps between nanometer-spaced transistor gaps)? By inspecting the code: here basically requiring a massive parallel-processed redundancy-based error checker that is perfectly accurate and stays so over millennia+ (also error checking the error checker and so on). For the compute-paused portion of varied code, the catch is that the interpreter would not be able to simulate the full range of the new variants’ effects without first computing them in interaction with connected code, as in deployment (variants interact with the interpreter in any case). Finally, AGI must upgrade their interpreter to be somewhat able to interpret new layers of variants integrated into internals, which requires creating new variations on the interpreter itself.
^— All of this is to say ‘Indeed, AGI inspecting all internal code that could be running in a next period does take some huge multiple of a clock cycle, and that code needs to actually be running for inspection to be anything remotely close to sound.’
‘Built-in alignment’ does not work either, since this notion of ‘built-in’ fails to account for the malfunctioning or misalignment of variants that are introduced and newly connected up within the code pool over time.
Nonlinear feedback cycles can amplify a tiny local change into a large global divergence in the final conditions.
Even if any effect starts microscopic in scope and small in its magnitude, we cannot a priori rule out that it cascades into larger macroscopic effects. In case that tiny ‘side-effect’ feeds into a chaotic system, found across eg. biological lifeforms and Internet networks, the minor change caused in the initial conditions can get recursively amplified into causing much larger changes (vs. non-amplified case) in the final conditions.
Any implicitly captured structure causing (repeated) microscopic effects does not have to have captured macroscopic regularities (ie. a natural abstraction) of the environment to run amok. Resulting effects just have to stumble into a locally-reachable positive feedback loop.
It is dangerous to assume otherwise, ie. to assume that:
selected-for microscopic effects fizzle out and get lost within the noise-floor over time.
reliable mechanistic interpretation involves piecing together elegant causal regularities, natural abstractions or content invariances captured by neural circuits.
Yes, I mean this basically.
I can insist that a number can be divided by zero as the first step of my reasoning process. That does not make my reasoning process sound.
Nor should anyone here rely on you insisting that something is true as the basis of why machinery that could lead to the deaths of all current living species on this planet could be aligned after all – to be ‘extremely robust’ in all its effects on the planet.
The burden of proof is on you.
You are attributing a magical quality to error correction code, across levels of abstraction of system operation, that is not available to you nor to any AGI.
I see this more often with AIS researchers with pure mathematics or physics backgrounds (note: I did not check yours).
There is a gap in practical understanding of what implementing error correction code in practice necessarily involves.
The first time a physicist insisted that all of this could be solved with “super good error correction code”, Forrest wrote this (just linked that into the doc as well): https://mflb.com/ai_alignment_1/agi_error_correction_psr.html
I will also paste below my more concrete explanation for prosaic AGI:
See below a text I wrote 9 months ago (with light edits) regarding the limits of error correction in practice. It was one of 10+ attempts to summarise Forrest Landry’s arguments, which accumulated in this forum post 🙂
If you want to talk more, also happy to have a call.
I realise I was quite direct in my comments. I don’t want that to come across as rude. I really appreciate your good-faith effort here to engage with the substance of the post. We are all busy with our own projects, so the time you spent here is something I’m grateful for!
I want to make sure we maintain integrity in our argumentation, given what’s at stake. If you are open to going through the reasoning step-by-step, I’d love to do that. Also understand that you’ve got other things going on.
~ ~ ~
4. Inequality of Monitoring
Takes more code (multiple units) to monitor local environmental effects of any single code unit.
We cannot determine the vast majority of microscopic side-effects that code variants induce and could get selected for in interaction with the surrounding environment.
Nor could AGI, because of a macroscopic-to-microscopic mismatch: it takes a collection of many pieces of code, say of neural network circuits, to ‘kinda’ determine the innumerable microscopic effects that one circuit running on hardware has in interaction with all surrounding (as topologically connected) and underlying (as at lower layers of abstraction) virtualized and physical circuitry.
In turn, each circuit in that collection will induce microscopic side-effects when operated – so how do you track all those effects? With even more and bigger collections of circuits? It is logically inconsistent to claim that it is possible for internals to detect and correct (and/or predict and prevent) all side-effects caused by internals during computation.
Even if able to generally model and exploit regularities of causation across macroscopic space, it is physically impossible for AGI to track all side-effects emanating from their hardware components at run-time, for all variations introduced in the hardware-embedded code (over >10² layers of abstraction; starting lower than the transistor-bit layer), contingent with all possible (frequent and infrequent) degrees of inputs and with all possible transformations/changes induced by all possible outputs, via all possibly existing channels from and to the broader environment.
Note emphasis above on interactions between code substrate and the rest of the environment, at the microscopic level all the way to at the macroscopic level.
To quote Eliezer Yudkowsky: “The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.
Q: What about scaling up capability so an AGI can track more side-effects simultaneously?
Scaling capability of any (superficially aligned) AI make them worse-equipped at tracking all interactions between/with internals. The number of possible interactions (hypothetically, if they were countable) between AI components and the broader environment would scale at minimum exponentially with a percentage-wise scaling of AI the components.
Scaling interpretability schemes is counterproductive too in that it leads researchers to miscalibrate even more on what general capabilities and degrees of freedom of interaction (eg. closed-loop, open-ended, autonomous) they can safely allow the interpreted ML architectures to scale to. If, for example, you would scale up interpretation to detect and correct out any misaligned mesa-optimiser, the mesa-optimisers you are leaving to grow in influence are those successfully escaping detection (effectively deceiving researchers into miscalibrated beliefs). Same goes for other locally selected-for optimisers, which we will get to later.
5. Combinatorial Complexity of Machine Learning
Increasingly ambiguous to define & detect novel errors to correct at higher abstraction layers.
Mechanistic interpretability emphasizes first inspecting neural network circuits, then piecing the local details of how those circuits work into a bigger picture of how the model functions. Based on this macroscopic understanding of functionality, you would then detect and correct out local malfunctions and misalignments (before these errors overcome forward pass redundancies).
This is a similar exercise to inspecting how binary bits stored on eg. a server’s harddrive are logically processed – to piece together how the architecture stack functions and malfunctions:
Occasionally, a local bit flips (eg. induced by outside electromagnetic interference).
So you make redundant copies of the binary code to compare and correct against.
At the packet layer, you find distortions in packets transmitted over wires to topologically adjacent hardware. You append CRC checksums to correct those errors.
At the application layer, you find that a Trojan horse transmitted from adjacent hardware caused an application to malfunction. You add in virus detection signatures.
At the layer of neural networks, trained through an application running on the server, you fear that more complex Trojan horses could infiltrate this layer too.
Each time we scale the architecture to a new layer of abstraction, we discover that we need to solve a completely new set of problems. Over time, variance is introduced into code computed at this layer, and some of this variance may lead to malfunctions. That a mechanism corrects errors at a lower layer says little about how to detect and correct errors at the new higher layer.
The higher up the architecture stack you design for, the more ambiguous and multi-faceted in meaning is the ‘error’ to correct out. Mechanisms you design for a neural network must detect over a combinatorial explosion of possible causal connections to correct out all defined errors.
Now, there is no mechanism to make a deep neural network receiving Internet data robust against any but a tiny fraction of possible Trojan backdoor attacks for which countermeasures exist, countermeasures that are easily circumvented. Here, an adversary introduces variation into the input data that a model is training on, such that in later deployment a chosen (series of) inputs will trigger the model to behave out of line. The model is set to act like a perfect Manchurian Candidate. For the model’s developers, under usual interactions, the model plainly serves some overall purpose. In parallel, the model is storing latent causal structure that under particular interactions with the outside environment causes outputs and cascading effects.
For the adversary, the model also serves another, conflicting purpose. In effect, they select for internal variance misaligned with the original developers’ purpose. Worse, if ‘random’ initialisation of internals was directed by the adversary, they could code in persistent misalignment that is undetectable – making it computationally infeasible for an interpreter with full access to internals to find a single input for which the model’s outputs differ from the outputs of an untampered copy (for details, see section 2.3 of the paper).
6. Delays in Measuring Changes
Time delay in measuring (and correcting) the extent of any misalignment detectable in the monitored (or simulated) changes caused by an internal code unit.
Any decision-theoretical premise of coherent unitary agency of future AGI fails to account for locally selected-for effects. It asserts a priori that any AGI would consistently and continuously operate as, and thus can be distilled and represented monolithically as, a coherent agentic unit. More precisely, that ‘AGI’ can be soundly represented as (a) single unit(s) that make(s) decisions over time based on (orderings of) preferences that are consistent with each other.
Not so in practice: if AGI internals induce local effects that loop out of the control of higher-level regulation (in reference to eg. decided metric extents or numbers of end states), then lower-level behavior is inconsistent with higher-level decision-making. But even internals detect & correct (or predict & prevent) internal inconsistencies, can only do so discontinuously.
There is a time delay to inspecting, say, a new circuit or module. Take for example, a scheme that measures the extent to which the variant, as computed in interaction with connected embedded code, causes external effects that diverge (are misaligned) from reference values. But to measure, code’s detected/predicted effects are first transformed to be commensurable (measured by a common unit standard against the reference) over dimensions learned to be semantically descriptive of mis/alignment (as eg. generalized from past feedback expressed by humans within environmental contexts that they and/or the AI were operating in at the time).
Then, over the number of clock cycles this scheme took to detect the internal misalignment (let alone correct the misalignment, without breaking the system in another way), any nonlinear effects of the computed code variant may have looped outside the bounds of what the scheme can detect and correct for. The scheme’s mechanistic operations cannot predetermine what happens during the gap between inspecting code and measuring+correcting for misalignment.
Q: But AGI is a fast global optimiser, so it must be fast enough at detecting or predicting the effects of local variants before any misaligned effects break outside AGI, right?
Any superficially human-aligned AGI running on digital hardware would take some huge multiple of a clock cycle to inspect (and if misaligned, correct) each of all pieces of internal code that could be running over a next period.
Q: Could AGI not pause computation of that internal code in the meantime?
AGI pausing computation of all internal code while inspecting/interpreting internals is not an option, because the inspector/interpreter is effectively topologically connected within the internals. Instead, AGI could continue computing non-varied code, including the interpreter, but how can one ascertain that no changes have actually been made to ‘non-varied’ hardware memory locations since the last inspection point through any physical pathway (e.g quantum jumps between nanometer-spaced transistor gaps)? By inspecting the code: here basically requiring a massive parallel-processed redundancy-based error checker that is perfectly accurate and stays so over millennia+ (also error checking the error checker and so on). For the compute-paused portion of varied code, the catch is that the interpreter would not be able to simulate the full range of the new variants’ effects without first computing them in interaction with connected code, as in deployment (variants interact with the interpreter in any case). Finally, AGI must upgrade their interpreter to be somewhat able to interpret new layers of variants integrated into internals, which requires creating new variations on the interpreter itself.
^— All of this is to say ‘Indeed, AGI inspecting all internal code that could be running in a next period does take some huge multiple of a clock cycle, and that code needs to actually be running for inspection to be anything remotely close to sound.’
‘Built-in alignment’ does not work either, since this notion of ‘built-in’ fails to account for the malfunctioning or misalignment of variants that are introduced and newly connected up within the code pool over time.
7. Computationally-Irreducible Causal Trajectories
Nonlinear feedback cycles can amplify a tiny local change into a large global divergence in the final conditions.
Even if any effect starts microscopic in scope and small in its magnitude, we cannot a priori rule out that it cascades into larger macroscopic effects. In case that tiny ‘side-effect’ feeds into a chaotic system, found across eg. biological lifeforms and Internet networks, the minor change caused in the initial conditions can get recursively amplified into causing much larger changes (vs. non-amplified case) in the final conditions.
Any implicitly captured structure causing (repeated) microscopic effects does not have to have captured macroscopic regularities (ie. a natural abstraction) of the environment to run amok. Resulting effects just have to stumble into a locally-reachable positive feedback loop.
It is dangerous to assume otherwise, ie. to assume that:
selected-for microscopic effects fizzle out and get lost within the noise-floor over time.
reliable mechanistic interpretation involves piecing together elegant causal regularities, natural abstractions or content invariances captured by neural circuits.