Utility functions (preferential or ethical, e.g. social welfare functions) can have lexicality, so that a difference in category A can be larger than the maximum difference in category B, but we can still make probabilistic tradeoffs between them. This can be done, for example, by having separate utility functions, fA:X→R and fB:X→R for A and B, respectively, such that

fA(x)−fA(y)≥1 for all x satisfying the condition P(x) and all y satisfying Q(y) (e.g.Q(y) can be the negation of P(y), although this would normally lead to discontinuity).

fB is bounded to have range in the interval [0,1] (or range in an interval of length at most 1).

Then we can define our utility function as the sum f=fA+fB , so

f(x)=fA(x)+fB(x)

This ensures that all outcomes with P(x) are at least as good as all outcomes with Q(x), without being Pascalian/fanatical to maximize fA regardless of what happens to fB. Note, however, that fB may be increasingly difficult to change as the number of moral patients increases, so we may approximate Pascalian fanaticism in this limit, anyway.

For example, fA(x)≤−1 if there is any suffering in x that meets a certain threshold of intensity, Q(x), and fA(x)=0 if there is no suffering at all in x, P(x). f can still be continuous this way.

If the probability that this threshold is met is p,0≤p<1 and the expected value of fA conditional on this is bounded below by −L, L>0, regardless of p for the choices available to you, then increasing fB by at least pL, which can be small, is better than trying to reduce p.

As another example, an AI could be incentivized to ensure it gets monitored by law enforcement. Its reward function could look like

f(x)=∞∑i=1IMi(x)+fB(x)

where IMi(x) is 1 if the AI is monitored by law enforcement and passes some test (or did nothing?) in period i, and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term can’t evaluate to infinity since that would allow fB to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow fB to overcome the IMi, unless you also discount the future in fB.

This should also allow us to modify the utility function fB, if preventing the modification would cause a test to be failed.

Furthermore, satisfying the IMi(x) strongly lexically dominates increasing fB(x), but we can still make expected tradeoffs between them.

The problem then reduces to designing the AI in such a way that it can’t cheat on the test, which might be something we can hard-code into it (e.g. its internal states and outputs are automatically sent to law enforcement), and so could be easier than getting fB right.

This overall approach can be repeated for any finite number of functions, f1,f2,…,fn. Recursively, you could define

gn+1(x)=σ(gn(x))+fn+1(x)

for σ:R→R increasing and bounded with range in an interval of length at most 1, e.g. some sigmoid function. In this way, each fk dominates the previous ones, as above.

To adapt to a more deontological approach (not rule violation minimization, but according to which you should not break a rule in order to avoid violating a rule later), you could use geometric discounting, and your (moral) utility function could look like:

f(x)=−∞∑i=0riI(xi),

where

1.x is the act and its consequences without uncertainty and you maximize the expected value of f over uncertainty in x,

2.x is broken into infinitely many disjoint intervals xi, with xi coming just before xi+1 temporally (and these intervals are chosen to have the same time endpoints for each possible x),

3.I(xi)=1 if a rule is broken in xi, and 0 otherwise, and

4.r is a constant, 0<r≤0.5.

So, the idea is that f(x)>f(y) if and only if the earliest rule violation in x happens later than the earliest one in y (at the level of precision determined by how the intervals are broken up). The value of r≤0.5 ensures this. (Well, there are some rare exceptions if r=0.5). You essentially count rule violations and minimize the number of them, but you use geometric discounting based on when the rule violation happens in such a way to ensure that it’s always worse to break a rule earlier than to break any number of rules later.

However, breaking x up into intervals this way probably sucks for a lot of reasons, and I doubt it would lead to prescriptions people with deontological views endorse when they maximize expected values.

This approach basically took for granted that a rule is broken not when I act, but when a particular consequence occurs.

If, on the other hand, a rule is broken at the time I act, maybe I need to use some functions Ii(x) instead of the I(xi), because whether or not I act now (in time interval i) and break a rule depends on what happens in the future. This way, however, Ii(x) could basically always be 1, so I don’t think this approach works.

This nesting approach with σ above also allows us to “fix” maximin/leximin under conditions of uncertainty to avoid Pascalian fanaticism, given a finite discretization of welfare levels or finite number of lexical thresholds. Let the welfare levels be t0>t1>⋯>tn, and define:

fk(x)=−∑iI(ui≤tk)

i.e.fk(x) is the number of individuals with welfare level at most tk, where uiis the welfare of individual i, and I(ui≤tk) is 1 if ui≤tk and 0 otherwise. Alternatively, we could use I(tk+1<ui≤tk).

In situations without uncertainty, this requires us to first choose among options that minimize the number of individuals with welfare at most tn, because fn takes priority over fk, for all k<n, and then, having done that, choose among those that minimize the number of individuals with welfare at most tn−1, since fn−1 takes priority over fk, for all k<n−1, and then choose among those that minimize the number of individuals with welfare at most tn−2, and so on, until t0.

This particular social welfare function assigns negative value to new existences when there are no impacts on others, which leximin/maximin need not do in general, although it typically does in practice, anyway.

This approach does not require welfare to be cardinal, i.e. adding and dividing welfare levels need not be defined. It also dodges representation theorems like this one (or the stronger one in Lemma 1 here, see the discussion here), because continuity is not satisfied (and welfare need not have any topological structure at all, let alone be real-valued). Yet, it still satisfies anonymity/symmetry/impartiality, monotonicity/Pareto, and separability/independence. Separability means that whether one outcome is better or worse than another does not depend on individuals unaffected by the choice between the two.

Here’s a way to capture lexical threshold utilitarianism with a separable theory and while avoiding Pascalian fanaticism, with a negative threshold t−<0 and a positive threshold t+ > 0:

σ(∑iui)+∑iI(ui≥t+)−∑iI(ui≤t−)

The first term is just standard utilitarianism, but squashed with a function σ:R→R into an interval of length at most 1.

The second/middle sum is the number of individuals (or experiences or person-moments) with welfare at least t+, which we add to the first term. Any change in number past this threshold dominates the first term.

The third/last sum is the number of individuals with welfare at most t−, which we subtract from the rest. Any change in number past this threshold dominates the first term.

Either of the second or third term can be omitted.

We could require t−≤ui≤t+ for all i, although this isn’t necessary.

More thresholds could be used, as in this comment: we would apply σ to the whole expression above, and then add new terms like the second and/or the third, with thresholds t++>t+ and t−−<t−, and repeat as necessary.

Utility functions (preferential or ethical, e.g. social welfare functions) can have lexicality, so that a difference in category A can be larger than the maximum difference in category B, but we can still make probabilistic tradeoffs between them. This can be done, for example, by having separate utility functions, fA:X→R and fB:X→R for A and B, respectively, such that

fB is bounded to have range in the interval [0,1] (or range in an interval of length at most 1).

Then we can define our utility function as the sum f=fA+fB , so

This ensures that all outcomes with P(x) are at least as good as all outcomes with Q(x), without being Pascalian/fanatical to maximize fA regardless of what happens to fB. Note, however, that fB may be increasingly difficult to change as the number of moral patients increases, so we may approximate Pascalian fanaticism in this limit, anyway.

For example, fA(x)≤−1 if there is any suffering in x that meets a certain threshold of intensity, Q(x), and fA(x)=0 if there is no suffering at all in x, P(x). f can still be continuous this way.

If the probability that this threshold is met is p,0≤p<1 and the expected value of fA conditional on this is bounded below by −L, L>0, regardless of p for the choices available to you, then increasing fB by at least pL, which can be small, is better than trying to reduce p.

As another example, an AI could be incentivized to ensure it gets monitored by law enforcement. Its reward function could look like

where IMi(x) is 1 if the AI is monitored by law enforcement and passes some test (or did nothing?) in period i, and 0 otherwise. You could put an upper bound on the number of periods or use discounting to ensure the right term can’t evaluate to infinity since that would allow fB to be ignored (maybe the AI will predict its expected lifetime to be infinite), but this would eventually allow fB to overcome the IMi, unless you also discount the future in fB.

This should also allow us to modify the utility function fB, if preventing the modification would cause a test to be failed.

Furthermore, satisfying the IMi(x) strongly lexically dominates increasing fB(x), but we can still make expected tradeoffs between them.

The problem then reduces to designing the AI in such a way that it can’t cheat on the test, which might be something we can hard-code into it (e.g. its internal states and outputs are automatically sent to law enforcement), and so could be easier than getting fB right.

This overall approach can be repeated for any finite number of functions, f1,f2,…,fn. Recursively, you could define

for σ:R→R increasing and bounded with range in an interval of length at most 1, e.g. some sigmoid function. In this way, each fk dominates the previous ones, as above.

To adapt to a more deontological approach (not rule violation minimization, but according to which you should not break a rule in order to avoid violating a rule later), you could use geometric discounting, and your (moral) utility function could look like:

where

1.x is the act and its consequences without uncertainty and you maximize the expected value of f over uncertainty in x,

2.x is broken into infinitely many disjoint intervals xi, with xi coming just before xi+1 temporally (and these intervals are chosen to have the same time endpoints for each possible x),

3.I(xi)=1 if a rule is broken in xi, and 0 otherwise, and

4.r is a constant, 0<r≤0.5.

So, the idea is that f(x)>f(y) if and only if the earliest rule violation in x happens later than the earliest one in y (at the level of precision determined by how the intervals are broken up). The value of r≤0.5 ensures this. (Well, there are some rare exceptions if r=0.5). You essentially count rule violations and minimize the number of them, but you use geometric discounting based on when the rule violation happens in such a way to ensure that it’s always worse to break a rule earlier than to break any number of rules later.

However, breaking x up into intervals this way probably sucks for a lot of reasons, and I doubt it would lead to prescriptions people with deontological views endorse when they maximize expected values.

This approach basically took for granted that a rule is broken not when I act, but when a particular consequence occurs.

If, on the other hand, a rule is broken at the time I act, maybe I need to use some functions Ii(x) instead of the I(xi), because whether or not I act now (in time interval i) and break a rule depends on what happens in the future. This way, however, Ii(x) could basically always be 1, so I don’t think this approach works.

This nesting approach with σ above also allows us to “fix” maximin/leximin under conditions of uncertainty to avoid Pascalian fanaticism, given a finite discretization of welfare levels or finite number of lexical thresholds. Let the welfare levels be t0>t1>⋯>tn, and define:

i.e.fk(x) is the number of individuals with welfare level at most tk, where uiis the welfare of individual i, and I(ui≤tk) is 1 if ui≤tk and 0 otherwise. Alternatively, we could use I(tk+1<ui≤tk).

In situations without uncertainty, this requires us to first choose among options that minimize the number of individuals with welfare at most tn, because fn takes priority over fk, for all k<n, and then, having done that, choose among those that minimize the number of individuals with welfare at most tn−1, since fn−1 takes priority over fk, for all k<n−1, and then choose among those that minimize the number of individuals with welfare at most tn−2, and so on, until t0.

This particular social welfare function assigns negative value to new existences when there are no impacts on others, which leximin/maximin need not do in general, although it typically does in practice, anyway.

This approach does not require welfare to be cardinal, i.e. adding and dividing welfare levels need not be defined. It also dodges representation theorems like this one (or the stronger one in Lemma 1 here, see the discussion here), because continuity is not satisfied (and welfare need not have any topological structure at all, let alone be real-valued). Yet, it still satisfies anonymity/symmetry/impartiality, monotonicity/Pareto, and separability/independence. Separability means that whether one outcome is better or worse than another does not depend on individuals unaffected by the choice between the two.

Here’s a way to capture lexical threshold utilitarianism with a separable theory and while avoiding Pascalian fanaticism, with a negative threshold t−<0 and a positive threshold t+ > 0:

The first term is just standard utilitarianism, but squashed with a function σ:R→R into an interval of length at most 1.

The second/middle sum is the number of individuals (or experiences or person-moments) with welfare at least t+, which we add to the first term. Any change in number past this threshold dominates the first term.

The third/last sum is the number of individuals with welfare at most t−, which we subtract from the rest. Any change in number past this threshold dominates the first term.

Either of the second or third term can be omitted.

We could require t−≤ui≤t+ for all i, although this isn’t necessary.

More thresholds could be used, as in this comment: we would apply σ to the whole expression above, and then add new terms like the second and/or the third, with thresholds t++>t+ and t−−<t−, and repeat as necessary.