To spot the error in these arguments, we only have to look up what cited ‘coherence theorems’ actually say. And yet the error seems to have gone uncorrected for more than a decade.
There are no coherence theorems. Authors in the AI safety community should stop suggesting that there are.
There are money-pump arguments, but the conclusions of these arguments are not theorems. The arguments depend on substantive and doubtful assumptions.
As I understand it, you propose two main arguments for the conclusion:
There are only arguments about money-pumps / dominated strategies, not theorems.
The Completeness axiom is suspicious.
I think (1) is straightforwardly wrong / conceptually confused. I agree with skepticism on the basis of (2), but people have already noticed this and discussed it (though phrased differently).
(Note the post makes other smaller claims that I either disagree with or think are misleading—don’t assume that if I don’t talk about some claim that means I think it’s correct.)
For the first argument, that there are no “coherence theorems”:
For now, the important thing to note is that the conclusions of money-pump arguments are not theorems. Theorems (like the VNM Theorem) can be proved without making any substantive assumptions. Money-pump arguments establish their conclusion only by making substantive assumptions: assumptions that might well be false. [...]
That is definitely not the difference between theorems and arguments. Theorems are typically of the form “Suppose X, then Y”; what is X if not an assumption?
For example, in (one direction of) the VNM theorem, the assumption is that the preferences satisfy transitivity, completeness, independence, and continuity, and the conclusion is that the preferences can be represented with a utility function.
The difference between theorems and arguments is that in theorems you are limited to a particular set of formal inference rules in moving from premises to conclusions, whereas in arguments there is a much more expansive and informal set of inference rules. (Though in practice people use informal arguments in proving theorems with the implicit promise that they could be rewritten with the formal inference rules with more effort.)
In any case, if you really want to see one, here’s a fairly boring money-pump theorem / coherence theorem:
Theorem. Suppose there is a set of possible worlds W={w1,…wN}, and an agent A:W×W→R that given a current world wi and a proposed new world wj specifies how much money it would pay to switch to wj from wi. Suppose further than A cannot be money pumped, that is, there is no sequence of worlds w1,w2,⋯wk such that (1) w1=wk and (2) k−1∑i=1A(wi,wi+1)>0. Then A must be transitive in the following sense: for any w1,w2,w3, if A(w1,w2)>0 and A(w2,w3)>0, then A(w3,w1)<0.
Proof. Suppose A is not transitive, so there exists some w1,w2,w3 where A(w1,w2)>0, A(w2,w3)>0, and A(w3,w1)≥0. But then then the sequence [w1,w2,w3,w1] is a money pump, leading to a contradiction.
This theorem is baking in some assumptions that you might find problematic, such as completeness (implicitly present in the type signature of A), or “no money pumps” (which you might object to because there’s no one to actually run the money pump on the agent), or the lack of time-dependence of the agent (again implicitly present in the type signature of A).
But I think this is clearly a theorem that is coming to a substantive conclusion about an agent based on “no dominated strategies” / “no money pumps”, so I don’t think you can really say that “coherence theorems don’t exist”.
For the second argument (that the completeness axiom is suspicious): I think this is basically expressing the same sort of objection that I express here, particularly the section “There are no coherence arguments that say you must have preferences”. I didn’t tie it to the Completeness axiom because I think it’s a mistake to get bogged down in the details of the specific assumptions present in theorems when you can make the same point in English, but it is the same conceptual point, as far as I can tell.
For what it’s worth my position here is “you can’t argue for AI risk solely via coherence theorems; you also have to argue for why the AI will be goal-directed in the first place, but there are plausible arguments for that conclusion (which are not based on coherence arguments)”.
Theorems are typically of the form “Suppose X, then Y”; what is X if not an assumption?
X is an antecedent.
Consider an example. Imagine I claim:
Suppose James is a bachelor. Then James is unmarried.
In making this claim, I am not assuming that James is a bachelor. My claim is true whether or not James is a bachelor.
I might temporarily assume that James is a bachelor, and then use that assumption to prove that James is unmarried. But when I conclude ‘Suppose James is a bachelor. Then James is unmarried’, I discharge that initial assumption. My conclusion no longer depends on it. Any conclusion which can be proved with no undischarged assumptions is a theorem.
Theorem. Suppose there is a set of possible worlds W={w1,…wN}, and an agent A:W×W→R that given a current world wi and a proposed new world wj specifies how much money it would pay to switch to wj from wi. Suppose further than A cannot be money pumped, that is, there is no sequence of worlds w1,w2,⋯wk such that (1) w1=wk and (2) k−1∑i=1A(wi,wi+1)>0. Then A must be transitive in the following sense: for any w1,w2,w3, if A(w1,w2)>0 and A(w2,w3)>0, then A(w3,w1)<0.
Proof. Suppose A is not transitive, so there exists some w1,w2,w3 where A(w1,w2)>0, A(w2,w3)>0, and A(w3,w1)≥0. But then then the sequence [w1,w2,w3,w1] is a money pump, leading to a contradiction.
I agree that this is a theorem. But it’s not a ‘coherence theorem’ (at least not in the way that I’ve used the term in this post, and not in the way that previous authors seem to have used the term [see the Appendix]): it doesn’t state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. It states only that, unless an agent’s preferences are acyclic, that agent is liable to pursue strategies that are dominated by some other available strategy.
You can call it a ‘coherence theorem’. Then it would be true that coherence theorems exist. But the important point remains: Premise 1 of the coherence argument is false. There are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. VNM doesn’t say that, Savage doesn’t say that, Bolker-Jeffrey doesn’t say that, Dutch Books don’t say that, Cox doesn’t say that, Complete Class doesn’t say that.
For the second argument (that the completeness axiom is suspicious): I think this is basically expressing the same sort of objection that I express here, particularly the section “There are no coherence arguments that say you must have preferences”. I didn’t tie it to the Completeness axiom because I think it’s a mistake to get bogged down in the details of the specific assumptions present in theorems when you can make the same point in English, but it is the same conceptual point, as far as I can tell.
I agree. I think the points that you make in that post are good.
For what it’s worth my position here is “you can’t argue for AI risk solely via coherence theorems; you also have to argue for why the AI will be goal-directed in the first place, but there are plausible arguments for that conclusion (which are not based on coherence arguments)”.
Thanks, I understand better what you’re trying to argue.
The part I hadn’t understood was that, according to your definition, a “coherence theorem” has to (a) only rely on antecedents of the form “no dominated strategies” and (b) conclude that the agent is representable by a utility function. I agree that on this definition there are no coherence theorems. I still think it’s not a great pedagogical or rhetorical move, because the definition is pretty weird.
I still disagree with your claim that people haven’t made this critique before.
From your discussion:
[The Complete Class Theorem] does refer to dominated strategies. However, the Complete Class Theorem starts off by assuming that the agent’s preferences over actions in sets of circumstances satisfy Completeness and Transitivity. If the agent’s preferences are not complete and transitive, the Complete Class Theorem does not apply. So, the Complete Class Theorem does not imply that agents must be representable as maximizing expected utility if they are to avoid pursuing dominated strategies.
So, you would agree that the following is an English description of a theorem:
If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.
The difference from your premise 1 is the part about the agent having complete, transitive preferences.
I feel pretty fine with justifying the transitive part via theorems basically like the one I gave above. You’d need to strengthen it a bit but that seems very doable. You do require a money pump argument rather than a dominated strategy argument, because when you have intransitive preferences it’s not even clear what a “dominated strategy” would be.
If you buy that, then the only difference is the part about the agent having complete preferences. Which is exactly what has been critiqued previously. So I still think that it is basically incorrect to say:
And yet the error seems to have gone uncorrected for more than a decade.
So, you would agree that the following is an English description of a theorem:
If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.
Yep, I agree with that.
I feel pretty fine with justifying the transitive part via theorems basically like the one I gave above.
Note that your money-pump justifies acyclicity(The agent does not strictly prefer A to B, B to C, and C to A) rather than the version of transitivity necessary for the VNM and Complete Class theorems (If the agent weakly prefers A to B, and B to C, then the agent weakly prefers A to C). Gustafsson thinks you need Completeness to get a money-pump for this version of transitivity working (see footnote 8 on page 3), and I’m inclined to agree.
when you have intransitive preferences it’s not even clear what a “dominated strategy” would be.
A dominated strategy would be a strategy which leads you to choose an option that is worse in some respect than another available option and not better than that other available option in any respect. For example, making all the trades and getting A- in the decision-situation below would be a dominated strategy, since you could have made no trades and got A:
So I still think that it is basically incorrect to say:
And yet the error seems to have gone uncorrected for more than a decade.
The error is claiming that
There exist theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
I haven’t seen anyone point out that that claim is false.
That said, one could reason as follows:
Rohin, John, and others have argued that agents with incomplete preferences can act in accordance with policies that make them immune to pursuing dominated strategies.
Agents with incomplete preferences cannot be represented as maximizing expected utility.
So, if Rohin’s, John’s, and others’ arguments are sound, there cannot exist theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
Then one would have corrected the error. But since the availability of this kind of reasoning is easily missed, it seems worth correcting the error directly.
Okay, it seems like we agree on the object-level facts, and what’s left is a disagreement about whether people have been making a major error. I’m less interested in that disagreement so probably won’t get into a detailed discussion, but I’ll briefly outline my position here.
The error is claiming that
There exist theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
I haven’t seen anyone point out that that claim is false.
The main way in which this claim is false (on your way of using words) is that it fails to note some of the antecedents in the theorem (completeness, maybe transitivity).
But I don’t think this is a reasonable way to use words, and I don’t think it’s reasonable to read the quotes in your appendix as claiming what you say they claim.
Converting math into English is a tricky business. Often a lot of the important “assumptions” in a theorem are baked into things like the type signature of a particular variable or the definitions of some key terms; in my toy theorem above I give two examples (completeness and lack of time-dependence). You are going to lose some information about what the theorem says when you convert it from math to English; an author’s job is to communicate the “important” parts of the theorem (e.g. the conclusion, any antecedents that the reader may not agree with, implications of the type signature that limit the applicability of the conclusion), which will depend on the audience.
As a result when you read an English description of a theorem, you should not expect it to state every antecedent. So it seems unreasonable to me to critique a claim in English about a theorem existing purely because it didn’t list all the antecedents.
I think it is reasonable to critique a claim in English about a theorem on the basis that it didn’t highlight an important antecedent that limits its applicability. If you said “AI alignment researchers should make sure to highlight the Completeness axiom when discussing coherence theorems” I’d be much more sympathetic (though personally my advice would be “AI alignment researchers should make sure to either argue for or highlight as an assumption the point that the AI is goal-directed / has preferences”).
Gustafsson thinks you need Completeness to get a money-pump for this version of transitivity working
Yup, good point, I think it doesn’t change the conclusion.
Often a lot of the important “assumptions” in a theorem are baked into things like the type signature of a particular variable or the definitions of some key terms; in my toy theorem above I give two examples (completeness and lack of time-dependence). You are going to lose some information about what the theorem says when you convert it from math to English; an author’s job is to communicate the “important” parts of the theorem (e.g. the conclusion, any antecedents that the reader may not agree with, implications of the type signature that limit the applicability of the conclusion), which will depend on the audience.
Yep, I agree with all of this.
Converting math into English is a tricky business.
Often, but not in this case. If authors understood the above points and meant to refer to the Complete Class Theorem, they need only have said:
If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.
(And they probably wouldn’t have mentioned Cox, Savage, etc.)
Yup, good point, I think it doesn’t change the conclusion.
I think it does. If the money-pump for transitivity needs Completeness, and Completeness is doubtful, then the money-pump for transitivity is doubtful too.
Your post argues for a strong conclusion:
As I understand it, you propose two main arguments for the conclusion:
There are only arguments about money-pumps / dominated strategies, not theorems.
The Completeness axiom is suspicious.
I think (1) is straightforwardly wrong / conceptually confused. I agree with skepticism on the basis of (2), but people have already noticed this and discussed it (though phrased differently).
(Note the post makes other smaller claims that I either disagree with or think are misleading—don’t assume that if I don’t talk about some claim that means I think it’s correct.)
For the first argument, that there are no “coherence theorems”:
That is definitely not the difference between theorems and arguments. Theorems are typically of the form “Suppose X, then Y”; what is X if not an assumption?
For example, in (one direction of) the VNM theorem, the assumption is that the preferences satisfy transitivity, completeness, independence, and continuity, and the conclusion is that the preferences can be represented with a utility function.
The difference between theorems and arguments is that in theorems you are limited to a particular set of formal inference rules in moving from premises to conclusions, whereas in arguments there is a much more expansive and informal set of inference rules. (Though in practice people use informal arguments in proving theorems with the implicit promise that they could be rewritten with the formal inference rules with more effort.)
In any case, if you really want to see one, here’s a fairly boring money-pump theorem / coherence theorem:
Theorem. Suppose there is a set of possible worlds W={w1,…wN}, and an agent A:W×W→R that given a current world wi and a proposed new world wj specifies how much money it would pay to switch to wj from wi. Suppose further than A cannot be money pumped, that is, there is no sequence of worlds w1,w2,⋯wk such that (1) w1=wk and (2) k−1∑i=1A(wi,wi+1)>0. Then A must be transitive in the following sense: for any w1,w2,w3, if A(w1,w2)>0 and A(w2,w3)>0, then A(w3,w1)<0.
Proof. Suppose A is not transitive, so there exists some w1,w2,w3 where A(w1,w2)>0, A(w2,w3)>0, and A(w3,w1)≥0. But then then the sequence [w1,w2,w3,w1] is a money pump, leading to a contradiction.
This theorem is baking in some assumptions that you might find problematic, such as completeness (implicitly present in the type signature of A), or “no money pumps” (which you might object to because there’s no one to actually run the money pump on the agent), or the lack of time-dependence of the agent (again implicitly present in the type signature of A).
But I think this is clearly a theorem that is coming to a substantive conclusion about an agent based on “no dominated strategies” / “no money pumps”, so I don’t think you can really say that “coherence theorems don’t exist”.
For the second argument (that the completeness axiom is suspicious): I think this is basically expressing the same sort of objection that I express here, particularly the section “There are no coherence arguments that say you must have preferences”. I didn’t tie it to the Completeness axiom because I think it’s a mistake to get bogged down in the details of the specific assumptions present in theorems when you can make the same point in English, but it is the same conceptual point, as far as I can tell.
For what it’s worth my position here is “you can’t argue for AI risk solely via coherence theorems; you also have to argue for why the AI will be goal-directed in the first place, but there are plausible arguments for that conclusion (which are not based on coherence arguments)”.
X is an antecedent.
Consider an example. Imagine I claim:
Suppose James is a bachelor. Then James is unmarried.
In making this claim, I am not assuming that James is a bachelor. My claim is true whether or not James is a bachelor.
I might temporarily assume that James is a bachelor, and then use that assumption to prove that James is unmarried. But when I conclude ‘Suppose James is a bachelor. Then James is unmarried’, I discharge that initial assumption. My conclusion no longer depends on it. Any conclusion which can be proved with no undischarged assumptions is a theorem.
I agree that this is a theorem. But it’s not a ‘coherence theorem’ (at least not in the way that I’ve used the term in this post, and not in the way that previous authors seem to have used the term [see the Appendix]): it doesn’t state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. It states only that, unless an agent’s preferences are acyclic, that agent is liable to pursue strategies that are dominated by some other available strategy.
You can call it a ‘coherence theorem’. Then it would be true that coherence theorems exist. But the important point remains: Premise 1 of the coherence argument is false. There are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. VNM doesn’t say that, Savage doesn’t say that, Bolker-Jeffrey doesn’t say that, Dutch Books don’t say that, Cox doesn’t say that, Complete Class doesn’t say that.
I agree. I think the points that you make in that post are good.
I agree with this too.
Thanks, I understand better what you’re trying to argue.
The part I hadn’t understood was that, according to your definition, a “coherence theorem” has to (a) only rely on antecedents of the form “no dominated strategies” and (b) conclude that the agent is representable by a utility function. I agree that on this definition there are no coherence theorems. I still think it’s not a great pedagogical or rhetorical move, because the definition is pretty weird.
I still disagree with your claim that people haven’t made this critique before.
From your discussion:
So, you would agree that the following is an English description of a theorem:
The difference from your premise 1 is the part about the agent having complete, transitive preferences.
I feel pretty fine with justifying the transitive part via theorems basically like the one I gave above. You’d need to strengthen it a bit but that seems very doable. You do require a money pump argument rather than a dominated strategy argument, because when you have intransitive preferences it’s not even clear what a “dominated strategy” would be.
If you buy that, then the only difference is the part about the agent having complete preferences. Which is exactly what has been critiqued previously. So I still think that it is basically incorrect to say:
Yep, I agree with that.
Note that your money-pump justifies acyclicity (The agent does not strictly prefer A to B, B to C, and C to A) rather than the version of transitivity necessary for the VNM and Complete Class theorems (If the agent weakly prefers A to B, and B to C, then the agent weakly prefers A to C). Gustafsson thinks you need Completeness to get a money-pump for this version of transitivity working (see footnote 8 on page 3), and I’m inclined to agree.
A dominated strategy would be a strategy which leads you to choose an option that is worse in some respect than another available option and not better than that other available option in any respect. For example, making all the trades and getting A- in the decision-situation below would be a dominated strategy, since you could have made no trades and got A:
The error is claiming that
There exist theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
I haven’t seen anyone point out that that claim is false.
That said, one could reason as follows:
Rohin, John, and others have argued that agents with incomplete preferences can act in accordance with policies that make them immune to pursuing dominated strategies.
Agents with incomplete preferences cannot be represented as maximizing expected utility.
So, if Rohin’s, John’s, and others’ arguments are sound, there cannot exist theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
Then one would have corrected the error. But since the availability of this kind of reasoning is easily missed, it seems worth correcting the error directly.
Okay, it seems like we agree on the object-level facts, and what’s left is a disagreement about whether people have been making a major error. I’m less interested in that disagreement so probably won’t get into a detailed discussion, but I’ll briefly outline my position here.
The main way in which this claim is false (on your way of using words) is that it fails to note some of the antecedents in the theorem (completeness, maybe transitivity).
But I don’t think this is a reasonable way to use words, and I don’t think it’s reasonable to read the quotes in your appendix as claiming what you say they claim.
Converting math into English is a tricky business. Often a lot of the important “assumptions” in a theorem are baked into things like the type signature of a particular variable or the definitions of some key terms; in my toy theorem above I give two examples (completeness and lack of time-dependence). You are going to lose some information about what the theorem says when you convert it from math to English; an author’s job is to communicate the “important” parts of the theorem (e.g. the conclusion, any antecedents that the reader may not agree with, implications of the type signature that limit the applicability of the conclusion), which will depend on the audience.
As a result when you read an English description of a theorem, you should not expect it to state every antecedent. So it seems unreasonable to me to critique a claim in English about a theorem existing purely because it didn’t list all the antecedents.
I think it is reasonable to critique a claim in English about a theorem on the basis that it didn’t highlight an important antecedent that limits its applicability. If you said “AI alignment researchers should make sure to highlight the Completeness axiom when discussing coherence theorems” I’d be much more sympathetic (though personally my advice would be “AI alignment researchers should make sure to either argue for or highlight as an assumption the point that the AI is goal-directed / has preferences”).
Yup, good point, I think it doesn’t change the conclusion.
I think that’s right.
Yep, I agree with all of this.
Often, but not in this case. If authors understood the above points and meant to refer to the Complete Class Theorem, they need only have said:
If an agent has complete, transitive preferences, and it does not pursue dominated strategies, then it must be representable as maximizing expected utility.
(And they probably wouldn’t have mentioned Cox, Savage, etc.)
I think it does. If the money-pump for transitivity needs Completeness, and Completeness is doubtful, then the money-pump for transitivity is doubtful too.
Upon rereading I realize I didn’t state this explicitly, but my conclusion was the following:
Transitivity depending on completeness doesn’t invalidate that conclusion.
Ah I see! Yep, agree with that.