Epistemic Status

Robust Values Hypothesis

Consider the following hypothesis:

There exists a “broad basin of attraction” around a privileged subset of human values^[1] (henceforth “ideal values”)
1. The larger the basin the more robust values are
2. Example operationalisations^[2] of “privileged subset” that gesture in the right direction:
  1. Minimal set that encompasses most of the informational content of “benevolent”/”universal”^[3] human values
  2. The “minimal latents” of “benevolent”/”universal” human values
3. Example operationalisations of “broad basin of attraction” that gesture in the right direction:
  1. A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in $# 3$ )
    1. Larger neighbourhood $\to$ larger basin
Said subset is a “naturalish” abstraction
1. The more natural the abstraction, the more robust values are
2. Example operationalisations of “naturalish abstraction”
  1. The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
    - More privileged $\to$ more natural
  2. Most efficient representations of our universe contain a simple embedding of the subset
    - Simpler embeddings $\to$ more natural
Points within this basin are suitable targets for optimisation
1. The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
2. Example operationalisations of “suitable targets for optimisation”:
  1. Optimisation of this target is existentially safe^[4]
  2. More strongly, we would be “happy” (where we fully informed) for the system to optimise for these points

The above claims specify different dimensions of “robustness”. Questions about robustness should be understood as asking about all of them.

Why Does it Matter?

The degree to which values are robust seems to be very relevant from an AI existential safety perspective.

The more robust values are, the more likely we are to get alignment by default (and vice versa).
The more robust values are, the easier it is to target AI systems at ideal values (and vice versa).
- Such targeting is one approach to solve the alignment problem^[5]
- If values are insufficiently robust, then value learning may not be viable at all
  - Including approaches like RHLF, CIRL/DIRL, etc.
  - It may not be feasible to train a system to optimise for suitable targets

Questions

A. What’s the best/most compelling evidence/arguments in favour of robust values

B. What’s the best/most compelling evidence/arguments against robust values?

C. To what degree do you think values are robust?

I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.

^
Using the shard theory conception of “value” as “contextual influence on decision making”.
^
To be clear, “example operationalisation” in this document does not refer to any kind of canonical formalisations. The example operationalisations aren’t even necessarily correct/accurate/sensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.
^
“Benevolent”: roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.
“Universal”: roughly the subset of human values that we are happy for other humans to optimise for.
^
Including “astronomical waste” as an existential catastrophe.
^
The other approach being to safeguard systems that may not necessarily be optimising for values that we’d be “happy” for them to pursue, were we fully informed.
Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.

[DISC] Are Values Robust?

Epistemic Status

Related Posts

Robust Values Hypothesis

Why Does it Matter?

Questions