There exists a âbroad basin of attractionâ around a privileged subset of human values[1] (henceforth âideal valuesâ)
The larger the basin the more robust values are
Example operationalisations[2] of âprivileged subsetâ that gesture in the right direction:
Minimal set that encompasses most of the informational content of âbenevolentâ/ââuniversalâ[3] human values
The âminimal latentsâ of âbenevolentâ/ââuniversalâ human values
Example operationalisations of âbroad basin of attractionâ that gesture in the right direction:
A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3)
Larger neighbourhood â larger basin
Said subset is a ânaturalishâ abstraction
The more natural the abstraction, the more robust values are
Example operationalisations of ânaturalish abstractionâ
The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
More privileged â more natural
Most efficient representations of our universe contain a simple embedding of the subset
Simpler embeddings â more natural
Points within this basin are suitable targets for optimisation
The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
Example operationalisations of âsuitable targets for optimisationâ:
Optimisation of this target is existentially safe[4]
More strongly, we would be âhappyâ (where we fully informed) for the system to optimise for these points
The above claims specify different dimensions of ârobustnessâ. Questions about robustness should be understood as asking about all of them.
Why Does it Matter?
The degree to which values are robust seems to be very relevant from an AI existential safety perspective.
The more robust values are, the more likely we are to get alignment by default (and vice versa).
The more robust values are, the easier it is to target AI systems at ideal values (and vice versa).
Such targeting is one approach to solve the alignment problem[5]
If values are insufficiently robust, then value learning may not be viable at all
Including approaches like RHLF, CIRL/âDIRL, etc.
It may not be feasible to train a system to optimise for suitable targets
Questions
A. Whatâs the best/âmost compelling evidence/âarguments in favour of robust values
B. Whatâs the best/âmost compelling evidence/âarguments against robust values?
C. To what degree do you think values are robust?
I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.
To be clear, âexample operationalisationâ in this document does not refer to any kind of canonical formalisations. The example operationalisations arenât even necessarily correct/âaccurate/âsensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.
The other approach being to safeguard systems that may not necessarily be optimising for values that weâd be âhappyâ for them to pursue, were we fully informed.
Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.
[DISC] Are Values Robust?
Epistemic Status
Discussion question.
Related Posts
See also:
Complexity of Value
Value is Fragile
The Hidden Complexity of Wishes
But exactly how complex and fragile?
Robust Values Hypothesis
Consider the following hypothesis:
There exists a âbroad basin of attractionâ around a privileged subset of human values[1] (henceforth âideal valuesâ)
The larger the basin the more robust values are
Example operationalisations[2] of âprivileged subsetâ that gesture in the right direction:
Minimal set that encompasses most of the informational content of âbenevolentâ/ââuniversalâ[3] human values
The âminimal latentsâ of âbenevolentâ/ââuniversalâ human values
Example operationalisations of âbroad basin of attractionâ that gesture in the right direction:
A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3)
Larger neighbourhood â larger basin
Said subset is a ânaturalishâ abstraction
The more natural the abstraction, the more robust values are
Example operationalisations of ânaturalish abstractionâ
The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
More privileged â more natural
Most efficient representations of our universe contain a simple embedding of the subset
Simpler embeddings â more natural
Points within this basin are suitable targets for optimisation
The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
Example operationalisations of âsuitable targets for optimisationâ:
Optimisation of this target is existentially safe[4]
More strongly, we would be âhappyâ (where we fully informed) for the system to optimise for these points
The above claims specify different dimensions of ârobustnessâ. Questions about robustness should be understood as asking about all of them.
Why Does it Matter?
The degree to which values are robust seems to be very relevant from an AI existential safety perspective.
The more robust values are, the more likely we are to get alignment by default (and vice versa).
The more robust values are, the easier it is to target AI systems at ideal values (and vice versa).
Such targeting is one approach to solve the alignment problem[5]
If values are insufficiently robust, then value learning may not be viable at all
Including approaches like RHLF, CIRL/âDIRL, etc.
It may not be feasible to train a system to optimise for suitable targets
Questions
A. Whatâs the best/âmost compelling evidence/âarguments in favour of robust values
B. Whatâs the best/âmost compelling evidence/âarguments against robust values?
C. To what degree do you think values are robust?
I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.
Using the shard theory conception of âvalueâ as âcontextual influence on decision makingâ.
To be clear, âexample operationalisationâ in this document does not refer to any kind of canonical formalisations. The example operationalisations arenât even necessarily correct/âaccurate/âsensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.
âBenevolentâ: roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.
âUniversalâ: roughly the subset of human values that we are happy for other humans to optimise for.
Including âastronomical wasteâ as an existential catastrophe.
The other approach being to safeguard systems that may not necessarily be optimising for values that weâd be âhappyâ for them to pursue, were we fully informed.
Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.