Whereas others think self-consistency is more important.
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
It’s not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn’t it come down to which principles you favor?
FDT gets you more utility than CDT. If you value literally anything in life more than you value “which ritual do I use to make my decisions?”, then you should go with FDT over CDT; that’s the core argument.
This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there’s no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).
The latter argument for CDT isn’t circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
I do think the argument ultimately needs to come down to an intuition about self-effacingness.
The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.
But there’s nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there’s also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.
So why shouldn’t I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.
More formally, it seems like the argument needs to be something along these lines:
Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Therefore—as an implication of points (3) and (4) -- R_CDT is not true.
Whether you buy the “No Self-Effacement” assumption in Step 4 -- or, alternatively, the countervailing “Don’t Make Things Worse” assumption that supports R_CDT—seems to ultimately be a mattter of intuition. At least, I don’t currently know what else people can appeal to here to resolve the disagreement.
[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn’t specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don’t think this ambiguity matters much for the argument.]]
[[SECOND SIDENOTE: I’m using the phrase “self-effacing” rather than “self-contradictory” here, because I think it’s more standard and because “self-contradictory” seems to suggest logical inconsistency.]]
But there’s nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT.
If the thing being argued for is “R_CDT plus P_SONOFCDT”, then that makes sense to me, but is vulnerable to all the arguments I’ve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT’s “Don’t Make Things Worse” principle.
If the thing being argued for is “R_CDT plus P_FDT”, then I don’t understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over “R_FDT plus P_FDT”? (Indeed, what difference between the two views would be intended here?)
So why shouldn’t I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be not self-effacing.
The argument against “R_CDT plus P_SONOFCDT” doesn’t require any mention of self-effacingness; it’s entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.
The argument against “R_CDT plus P_FDT” seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don’t understand what this view would mean or why anyone would endorse it (and I don’t take you to be endorsing it).
For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don’t think this ambiguity matters much for the argument.
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what “expected utility” means.
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT “gets you less utility” rather than the point that P_SONOFCDT “gets you less utility.” So my comment was aiming to explain why I don’t think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the “No Self-Effacement Principle”). But it sounds like you might agree that this fact doesn’t on its own provide a strong challenge.
If the thing being argued for is “R_CDT plus P_SONOFCDT”, then that makes sense to me, but is vulnerable to all the arguments I’ve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT’s “Don’t Make Things Worse” principle.
In response to the first argument alluded to here: “Gets the most [expected] utility” is ambiguous, as I think we’ve both agreed.
My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So—if we evaluate the expected utility of a decision to adopt a policy from a casual perspective—it seems to me that P_SONOFCDT “gets the most expected utility.”
If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may “get the most utility” (because policy adoption decisions may be non-causally correlated.)
Apologies if I’m off-base, but it reads to me like you might be suggesting an argument along these lines:
R_CDT says that it is rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
Therefore R_CDT is not true.
The natural response to this argument is that it’s not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its “expected utility” defined in causal terms. So someone starting from the position that R_CDT is true obviously won’t accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its “expected utility” defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.
Overall at least this particular simple argument—that R_CDT is false because P_SONOFCDT gets less “expected utility” as defined in evidential/quasi-evidential terms—would seemingly fail to due circularity. But you may have in mind a different argument.
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what “expected utility” means.
I felt confused by this comment. Doesn’t even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you don’t want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more “rational” decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?
You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit to—P_SONOFCDT—sometimes outputs decisions that definitely make things worse. So “Don’t Make Things Worse” implies that some of the decisions outputted by P_SONOFCDT are irrational.
But I still don’t see what the argument is here unless we’re assuming “No Self-Effacement.” It still seems to me like we have a few initial steps and then a missing piece.
(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the “Don’t Make Things Worse” Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.
The “No Self-Effacement” Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume “No Self-Effacement” in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if we’re not assuming “No Self-Effacement,” then it’s not clear to me how we get there.
Actually, in the context of this particular argument, I suppose we don’t really have the option of assuming that “No Self-Effacement” is true—because this assumption would be inconsistent with the earlier assumption that “Don’t Make Things Worse” is true. So I’m not sure it’s actually possible to make this argument schema work in any case.
There may be a pretty different argument here, which you have in mind. I at least don’t see it yet though.
There may be a pretty different argument here, which you have in mind. I at least don’t see it yet though.
Perhaps the argument is something like:
“Don’t make things worse” (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_′s, and not about P_′s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn’t get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_′s, and not about P_′s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn’t get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
Here are two logically inconsistent principles that could be true:
Don’t Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they’re mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.
One could argue that R_CDT sympathists don’t actually have much stronger intuitions regarding the first principle than the second—i.e. that their intuitions aren’t actually very “targeted” on the first one—but I don’t think that would be right. At least, it’s not right in my case.
A more viable strategy might be to argue for something like a meta-principle:
The ‘Don’t Make Things Worse’ Meta-Principle: If you find “Don’t Make Things Worse” strongly intuitive, then you should also find “Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse” just about as intuitive.
If the meta-principle were true, then I guess this would sort of imply that people’s intuitions in favor of “Don’t Make Things Worse” should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.
But I don’t see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of “Don’t Make Things Worse” :)
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.
Bomb Case: Omega puts a million dollars in a transparent box if he predicts you’ll open it. He puts a bomb in the transparent box if he predicts you won’t open it. He’s only wrong about one in a trillion times.
Now suppose you enter the room and see that there’s a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don’t open the box, then nothing bad will happen to you. You’ll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.
Intuitively, this decision strikes me as deeply irrational. You’re intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.
Now, let’s step back a time step. Suppose you know that you’re sort of person who would refuse to kill yourself by detonating the bomb. You might decide that—since Omega is such an accurate predictor—it’s worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you’re making now is nonetheless making things better in expectation.
This decision strikes me as pretty intuitively rational. You’re violating the second principle—the “Don’t Commit to a Policy...” Principle—but this violation just doesn’t seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.
(This obviously just a description of my own intuitions, as they stand, though.)
It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.
By triggering the bomb, you’re making things worse from your current perspective, but making things better from the perspective of earlier you. Doesn’t that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you don’t. After updating, you think you’re either a simulation within Omega’s prediction so your action has no effect on yourself or you’re in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/inconsistency.
Giving the human tendency to change our (UDT-)utility functions by updating, it’s not clear what to do (or what is right), and I think this reduces UDT’s intuitive appeal and makes it less of a slam-dunk over CDT/EDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isn’t adequately explained in MIRI’s decision theory papers.)
Don’t Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
...
One could argue that R_CDT sympathists don’t actually have much stronger intuitions regarding the first principle than the second—i.e. that their intuitions aren’t actually very “targeted” on the first one—but I don’t think that would be right. At least, it’s not right in my case.
I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that that’s right in your case.)
But I feel like the second doesn’t quite capture what I had in mind regarding the DMTW intuition applied to P_′s.
Consider an alternate version:
If a decision would definitely make things worse, then taking that decision is not good policy.
Or alternatively:
If a decision would definitely make things worse, a rational person would not take that decision.
It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the ”… then taking that decision is not rational.” version is. And it’s only after you’ve considered prisoners’ dilemmas or Newcomb’s paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from what’s rational in the moment.
(But maybe others would disagree on how intuitive these versions are.)
EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then it’s not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.
Of course, this argument is moot if it’s true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe that’s the case? But I think it’s a little more ambiguous with the ”… is not good policy” or “a rational person would not...” versions than with the “Don’t commit to a policy...” version.
EDIT2: Does what I’m trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
If the thing being argued for is “R_CDT plus P_SONOFCDT” … If the thing being argued for is “R_CDT plus P_FDT...
Just as a quick sidenote:
I’ve been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.
If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.
The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren’t). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.
[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you’re building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner’s dilemma (e.g. some version of P_FDT). But if R_CDT is true and you’ve just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner’s dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view—which I’ve alluded to above—is that the various proposed criteria of rightness don’t in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
FDT gets you more utility than CDT. If you value literally anything in life more than you value “which ritual do I use to make my decisions?”, then you should go with FDT over CDT; that’s the core argument.
This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there’s no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).
The latter argument for CDT isn’t circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
I do think the argument ultimately needs to come down to an intuition about self-effacingness.
The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.
But there’s nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there’s also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.
So why shouldn’t I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.
More formally, it seems like the argument needs to be something along these lines:
Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Therefore—as an implication of points (3) and (4) -- R_CDT is not true.
Whether you buy the “No Self-Effacement” assumption in Step 4 -- or, alternatively, the countervailing “Don’t Make Things Worse” assumption that supports R_CDT—seems to ultimately be a mattter of intuition. At least, I don’t currently know what else people can appeal to here to resolve the disagreement.
[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn’t specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don’t think this ambiguity matters much for the argument.]]
[[SECOND SIDENOTE: I’m using the phrase “self-effacing” rather than “self-contradictory” here, because I think it’s more standard and because “self-contradictory” seems to suggest logical inconsistency.]]
If the thing being argued for is “R_CDT plus P_SONOFCDT”, then that makes sense to me, but is vulnerable to all the arguments I’ve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT’s “Don’t Make Things Worse” principle.
If the thing being argued for is “R_CDT plus P_FDT”, then I don’t understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over “R_FDT plus P_FDT”? (Indeed, what difference between the two views would be intended here?)
The argument against “R_CDT plus P_SONOFCDT” doesn’t require any mention of self-effacingness; it’s entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.
The argument against “R_CDT plus P_FDT” seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don’t understand what this view would mean or why anyone would endorse it (and I don’t take you to be endorsing it).
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what “expected utility” means.
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT “gets you less utility” rather than the point that P_SONOFCDT “gets you less utility.” So my comment was aiming to explain why I don’t think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the “No Self-Effacement Principle”). But it sounds like you might agree that this fact doesn’t on its own provide a strong challenge.
In response to the first argument alluded to here: “Gets the most [expected] utility” is ambiguous, as I think we’ve both agreed.
My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So—if we evaluate the expected utility of a decision to adopt a policy from a casual perspective—it seems to me that P_SONOFCDT “gets the most expected utility.”
If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may “get the most utility” (because policy adoption decisions may be non-causally correlated.)
Apologies if I’m off-base, but it reads to me like you might be suggesting an argument along these lines:
R_CDT says that it is rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
Therefore R_CDT is not true.
The natural response to this argument is that it’s not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its “expected utility” defined in causal terms. So someone starting from the position that R_CDT is true obviously won’t accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its “expected utility” defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.
Overall at least this particular simple argument—that R_CDT is false because P_SONOFCDT gets less “expected utility” as defined in evidential/quasi-evidential terms—would seemingly fail to due circularity. But you may have in mind a different argument.
I felt confused by this comment. Doesn’t even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you don’t want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more “rational” decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?
You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit to—P_SONOFCDT—sometimes outputs decisions that definitely make things worse. So “Don’t Make Things Worse” implies that some of the decisions outputted by P_SONOFCDT are irrational.
But I still don’t see what the argument is here unless we’re assuming “No Self-Effacement.” It still seems to me like we have a few initial steps and then a missing piece.
(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the “Don’t Make Things Worse” Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.
The “No Self-Effacement” Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume “No Self-Effacement” in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if we’re not assuming “No Self-Effacement,” then it’s not clear to me how we get there.
Actually, in the context of this particular argument, I suppose we don’t really have the option of assuming that “No Self-Effacement” is true—because this assumption would be inconsistent with the earlier assumption that “Don’t Make Things Worse” is true. So I’m not sure it’s actually possible to make this argument schema work in any case.
There may be a pretty different argument here, which you have in mind. I at least don’t see it yet though.
Perhaps the argument is something like:
“Don’t make things worse” (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_′s, and not about P_′s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn’t get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
Here are two logically inconsistent principles that could be true:
Don’t Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they’re mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.
One could argue that R_CDT sympathists don’t actually have much stronger intuitions regarding the first principle than the second—i.e. that their intuitions aren’t actually very “targeted” on the first one—but I don’t think that would be right. At least, it’s not right in my case.
A more viable strategy might be to argue for something like a meta-principle:
The ‘Don’t Make Things Worse’ Meta-Principle: If you find “Don’t Make Things Worse” strongly intuitive, then you should also find “Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse” just about as intuitive.
If the meta-principle were true, then I guess this would sort of imply that people’s intuitions in favor of “Don’t Make Things Worse” should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.
But I don’t see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of “Don’t Make Things Worse” :)
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.
Bomb Case: Omega puts a million dollars in a transparent box if he predicts you’ll open it. He puts a bomb in the transparent box if he predicts you won’t open it. He’s only wrong about one in a trillion times.
Now suppose you enter the room and see that there’s a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don’t open the box, then nothing bad will happen to you. You’ll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.
Intuitively, this decision strikes me as deeply irrational. You’re intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.
Now, let’s step back a time step. Suppose you know that you’re sort of person who would refuse to kill yourself by detonating the bomb. You might decide that—since Omega is such an accurate predictor—it’s worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you’re making now is nonetheless making things better in expectation.
This decision strikes me as pretty intuitively rational. You’re violating the second principle—the “Don’t Commit to a Policy...” Principle—but this violation just doesn’t seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.
(This obviously just a description of my own intuitions, as they stand, though.)
By triggering the bomb, you’re making things worse from your current perspective, but making things better from the perspective of earlier you. Doesn’t that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you don’t. After updating, you think you’re either a simulation within Omega’s prediction so your action has no effect on yourself or you’re in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/inconsistency.
Giving the human tendency to change our (UDT-)utility functions by updating, it’s not clear what to do (or what is right), and I think this reduces UDT’s intuitive appeal and makes it less of a slam-dunk over CDT/EDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isn’t adequately explained in MIRI’s decision theory papers.)
I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that that’s right in your case.)
But I feel like the second doesn’t quite capture what I had in mind regarding the DMTW intuition applied to P_′s.
Consider an alternate version:
Or alternatively:
It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the ”… then taking that decision is not rational.” version is. And it’s only after you’ve considered prisoners’ dilemmas or Newcomb’s paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from what’s rational in the moment.
(But maybe others would disagree on how intuitive these versions are.)
EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then it’s not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.
Of course, this argument is moot if it’s true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe that’s the case? But I think it’s a little more ambiguous with the ”… is not good policy” or “a rational person would not...” versions than with the “Don’t commit to a policy...” version.
EDIT2: Does what I’m trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
Just as a quick sidenote:
I’ve been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.
If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.
The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren’t). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.
[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you’re building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner’s dilemma (e.g. some version of P_FDT). But if R_CDT is true and you’ve just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner’s dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view—which I’ve alluded to above—is that the various proposed criteria of rightness don’t in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
Do you mean
It seems to better fit the pattern of the example just prior.