“What makes a decision ‘good’ if the decision happens inside an AI?” and “What makes a decision ‘good’ if the decision happens inside a brain?” aren’t orthogonal questions, or even all that different; they’re two different ways of posing the same question.
I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.
Here’s another go:
Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase “decision theory” in this context typically refers to a claim about necessary and/or sufficient conditions for a decision being rational. To use different jargon, in this context a “decision theory” refers to a proposed “criterion of rightness.”
When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.
We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.
For each proposed criterion of rightness, it’s possible to define a decision procedure that only outputs decisions that fulfill the criterion. For example, we can define P_CDT as a decision procedure that involves only taking actions that R_CDT claims are rational.
My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves. R_CDT claims that we should do whatever will have the best effects—and, in many cases, building agents that follow a decision procedure other than P_CDT is likely to have the best effects. More generally: Most proposed criteria of rightness imply that it can be rational to build agents that sometimes behave irrationally.
MIRI’s AI work is properly thought of as part of the “success-first decision theory” approach in academic decision theory.
One possible criterion of rightness, which I’ll call R_UDT, is something like this: An action is rational only if it would have been chosen by whatever decision procedure would have produced the most expected value if consistently followed over an agent’s lifetime. For example, this criterion of rightness says that it is rational to one-box in the transparent Newcomb scenario because agents who consistently follow one-boxing policies tend to do better over their lifetimes.
I could be wrong, but I associate the “success-first approach” with something like the claim that R_UDT is true. This would definitely constitute a really interesting and significant divergence from mainstream opinion within academic decision theory. Academic decision theorists should care a lot about whether or not it’s true.
But I’m also not sure if it matters very much, practically, whether R_UDT or R_CDT is true. It’s not obvious to me that they recommend building different kinds of decision procedures into AI systems. For example, both seem to recommend building AI systems that would one-box in the transparent Newcomb scenario.
You can go with Paul and say that a lot of these distinctions are semantic rather than substantive—that there isn’t a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether they’re successful, vs. some other criterion.
I disagree that any of the distinctions here are purely semantic. But one could argue that normative anti-realism is true. In this case, there wouldn’t really be any such thing as the criterion of rightness for decisions. Neither R_CDT nor R_UDT nor any other proposed criterion would be “correct.”
In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.
[[EDIT: Note that Will also emphasizes the importance of the criterion-of-rightness vs. decision-procedure distinction in his critique of the FDT paper: “[T]hey’re [most often] asking what the best decision procedure is, rather than what the best criterion of rightness is… But, if that’s what’s going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So it’s odd to have a whole paper comparing them side-by-side as if they are rivals.”]]
I agree that these three distinctions are important:
“Picking policies based on whether they satisfy a criterion X” vs. “Picking policies that happen to satisfy a criterion X”. (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
“Trying to follow a decision rule Y ‘directly’ or ‘on the object level’” vs. “Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y”. (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you’ve come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
“A decision rule that prescribes outputting some action or policy and doesn’t care how you do it” vs. “A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy”. (E.g., a rule that says ‘maximize the aggregate welfare of moral patients’ vs. a specific mental algorithm intended to achieve that end.)
The first distinction above seems less relevant here, since we’re mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly “trying to do what’s right”.
As a side-note, I do want to emphasize that from the MIRI cluster’s perspective, it’s fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the system’s alignment-relevant properties aren’t obscured and the system ends up safe and reliable).
The main reason to work on decision theory in AI alignment has never been “What if people don’t make AI ‘decision-theoretic’ enough?” or “What if people mistakenly think CDT is correct and so build CDT into their AI system?” The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we’ve even been misunderstanding basic things at the level of “decision-theoretic criterion of rightness”.
It’s not that I want decision theorists to try to build AI systems (even notional ones). It’s that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That’s part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
The second distinction (“following a rule ‘directly’ vs. following it by adopting a sub-rule or via self-modification”) seems more relevant. You write:
My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves.
Far from being a distinction proponents of UDT/FDT neglect, this is one of the main grounds on which UDT/FDT proponents criticize CDT (from within the “success-first” tradition). This is because agents that are reflectively inconsistent in the manner of CDT—ones that take actions they know they’ll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.
A human following CDT should endorse “stop following CDT,” since CDT isn’t self-endorsing. It’s not even that they should endorse “keep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT ends”; they need to completely abandon CDT even at the meta-level of “what sort of decision rule should I follow?” and modify themselves into purely following an entirely new decision rule, or else they’ll continue to perform poorly by CDT’s lights.
The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, “son-of-CDT”, is roughly:
Have whatever disposition-to-act gets the most utility, unless I’m in future situations like “a twin prisoner’s dilemma against a perfect copy of my future self where the copy was forked from me before I started following this rule”, in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisoner’s dilemmas with that were copied from my brain after I started following this rule).
The fact that CDT doesn’t endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don’t), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.
But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in “Toward Idealized Decision Theory”, where “CDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action.”
In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents’ due to events that happened after she turned 20 (such as “the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory”). But she’ll refuse to coordinate for reasons like “we hung out a lot the summer before my 20th birthday”, “we spent our whole childhoods and teen years living together and learning from the same teachers”, and “we all have similar decision-making faculties due to being members of the same species”. There’s no principled reason to draw this temporal distinction; it’s just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.
Regarding the third distinction (“prescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of output”), I’d say that it’s primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called “Functional Decision Theory” and not (e.g.) “Algorithmic Decision Theory”: the focus is explicitly on “what outcomes do you produce?”, not on how you produce them.
(Thus, an FDT agent can cooperate with another agent whenever the latter agent’s input-output relations match FDT’s prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)
The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the “understand what good reasoning is” project (e.g., logical uncertainty) and how they relate to decision theory.
I agree that these three distinctions are important
“Picking policies based on whether they satisfy a criterion X” vs. “Picking policies that happen to satisfy a criterion X”. (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
“Trying to follow a decision rule Y ‘directly’ or ‘on the object level’” vs. “Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y”. (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you’ve come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
“A decision rule that prescribes outputting some action or policy and doesn’t care how you do it” vs. “A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy”. (E.g., a rule that says ‘maximize the aggregate welfare of moral patients’ vs. a specific mental algorithm intended to achieve that end.)
The second distinction here is most closely related to the one I have in mind, although I wouldn’t say it’s the same. Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.
“Hedonistic utilitarianism is correct” would be a non-decision-theoretic example of (a). “Making decisions on the basis of coinflips” would be an example of (b).
In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b).
I now have the sense I’m probably not doing a good job of communicating what I have in mind, though.
The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we’ve even been misunderstanding basic things at the level of “decision-theoretic criterion of rightness”.
I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here.
It’s that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That’s part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess I’m just not sure that the question that most academic decision theorists are trying to answer—and the literature they’ve produced on it—will ultimately be very relevant.
The fact that CDT doesn’t endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don’t), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.
The fact that R_CDT is “self-effacing”—i.e. the fact that it doesn’t always recommend following P_CDT—definitely does seem like a point of intuitive evidence against R_CDT.
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”
There’s then a question of which of these considerations is more relevant, when judging which of the two normative theories is more likely to be correct. The failure of R_UDT to satisfy the “Don’t Make Things Worse Principle” seems more important to me, but I don’t really know how to argue for this point beyond saying that this is just my intuition. I think that the failure of R_UDT to satisfying this principle—or something like it—is also probably the main reason why many philosophers find it intuitively implausible.
(IIRC the first part of Reasons and Persons is mostly a defense of the view that the correct theory of rationality may be self-effacing. But I’m not really familiar with the state of arguments here.)
In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents’ due to events that happened after she turned 20 (such as “the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory”). But she’ll refuse to coordinate for reasons like “we hung out a lot the summer before my 20th birthday”, “we spent our whole childhoods and teen years living together and learning from the same teachers”, and “we all have similar decision-making faculties due to being members of the same species”. There’s no principled reason to draw this temporal distinction; it’s just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.
I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.) So I don’t think there should be any weird “Frankenstein” decision procedure thing going on.
….Thinking more about it, though, I’m now less sure how much the different normative decision theories should converge in their recommendations about AI design. I think they all agree that we should build systems that one-box in Newcomb-style scenarios. I think they also agree that, if we’re building twins, then we should design these twins to cooperate in twin prisoner’s dilemmas. But there may be some other contexts where acausal cooperation considerations do lead to genuine divergences. I don’t have very clear/settled thoughts about this, though.
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”
I think “Don’t Make Things Worse” is a plausible principle at first glance.
One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the “Don’t Make Things Worse Principle” makes things worse.
Once you’ve already adopted son-of-CDT, which says something like “act like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this rule”, it’s not clear to me why you wouldn’t just go: “Oh. CDT has lost the thing I thought made it appealing in the first place, this ‘Don’t Make Things Worse’ feature. If we’re going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?”
A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demski’s comments:
[...] In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn’t look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn’t pay well.
[...] One way of thinking about this is to say that the FDT notion of “decision problem” is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified ‘bomb’ with just the certain information that ‘left’ is (causally and evidentially) very bad and ‘right’ is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT “rejects” decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.
[...] This also hopefully clarifies the sense in which I don’t think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
There’s a subtle point here, though, since Will describes the decision problem from an updated perspective—you already know the bomb is in front of you. So UDT “changes the problem” by evaluating “according to the prior”. From my perspective, because the very statement of theBombproblem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let’s call the way-you-put-agents-into-the-scenario the “construction”. We then evaluate agents on how well they deal with the construction.
For examples like Bomb, the construction gives us the overall probability distribution—this is then used for the expected value which UDT’s optimality notion is stated in terms of.
For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility).
A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state.
This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation.
I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy “Don’t Make Things Worse” is more relevant when trying to judge the likelihood of a criterion of rightness being correct. I’m not sure whether it’s possible to do much here other than present personal intuitions.
The point that R_UDT only violates the “Don’t Make Things Worse” principle only infrequently seems relevant, but I’m still not sure this changes my intuitions very much.
If we’re going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?
I may just be missing something, but I don’t see what this theoretical ugliness is. And I don’t intuitively find the ugliness/elegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct.
[[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because it’s costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isn’t regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism is expressing a pretty mainstream position when it says: “[T]here is nothing incoherent about proposing a decision procedure that is separate from one’s criterion of the right.… Criteria can, thus, be self-effacing without being self-refuting.” Insofar as people don’t tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories, it’s not clear why they should buy it as a slam-dunk argument against the truth of normative decision theories.]]
is more relevant when trying to judge the likelihood of a criterion of rightness being correct
Sorry to drop in in the middle of this back and forth, but I am curious—do you think it’s quite likely that there is a single criterion of rightness that is objectively “correct”?
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. “don’t make things worse”, or “don’t be self-effacing”). And so far there doesn’t seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrow’s theorem, perhaps there’s just no single perfect criterion of rightness.
In other words, once we agree that CDT doesn’t make things worse, but that UDT is better as a general policy, is there anything left to argue about about which is “correct”?
EDIT: Decided I had better go and read your Realism and Rationality post, and ended up leaving a lengthy comment there.
Sorry to drop in in the middle of this back and forth, but I am curious—do you think it’s quite likely that there is a single criterion of rightness that is objectively “correct”?
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. “don’t make things worse”, or “don’t be self-effacing”). And so far there doesn’t seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrow’s theorem, perhaps there’s just no single perfect criterion of rightness.
Happy to be dropped in on :)
I think it’s totally conceivable that no criterion of rightness is correct (e.g. because the concept of a “criterion of rightness” turns out to be some spooky bit of nonsense that doesn’t really map onto anything in the real world.)
I suppose the main things I’m arguing are just that:
When a philosopher expresses support for a “decision theory,” they are typically saying that they believe some claim about what the correct criterion of rightness is.
Claims about the correct criterion of rightness are distinct from decision procedures.
Therefore, when a member of the rationalist community uses the word “decision theory” to refer to a decision procedure, they are talking about something that’s pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] don’t directly speak to the questions that most academic “decision theorists” are actually debating with one another.
I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. I’m definitely not a super hardcore R_CDT believer.
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. “don’t make things worse”, or “don’t be self-effacing”). And so far there doesn’t seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrow’s theorem, perhaps there’s just no single perfect criterion of rightness.
I guess here—in almost definitely too many words—is how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.)
It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.
One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn’t even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property P—but just drop the assumption that these things will also have both property Q1 and property Q2.
This obviously a pretty abstract description, so I’ll give a few examples. (No need to read the examples if the point seems obvious.)
Ethics: I might initially be inclined to think that it’s always ethical (property P) to maximize happiness and that it’s always unethical to torture people. But then I may realize that there’s an inconsistency here: in at least rare circumstances, such as ticking time-bomb scenarios where torture can extract crucial information, there may be no decision that is both happiness maximizing (Q1) and torture-avoiding (Q2). It seems like a natural reaction here is just to drop either the belief that maximizing happiness is always ethical or that torture is always unethical. It doesn’t seem like I need to abandon my belief that some actions have the property of being ethical.
Theology: I might initially be inclined to think that God is all-knowing, all-powerful, and all-good. But then I might come to believe (whether rightly or not) that, given the existance of evil, these three properties are inconsistent. I might then continue to believe that God exists, but just drop my belief that God is all-good. (To very awkwardly re-express this in the language of properties: This would mean dropping my belief that any entity that has the property of being God also has the property of being all-good).
Politician-bashing: I might initially be inclined to characterize some politician both as an incompetent leader and as someone who’s successfully carrying out an evil long-term plan to transform the country. Then I might realize that these two characterizations are in tension with one another. A pretty natural reaction, then, might be to continue to believe the politician exists—but just drop my belief that they’re incompetent.
To turn to the case of the decision-theoretic criterion of rightness, I might initially be inclined to think that the correct criterion of rightness will satisfy both “Don’t Make Things Worse” and “No Self-Effacement.” It’s now become clear, though, that no criterion of rightness can satisfy both of these principles. I think it’s pretty reasoanble, then, to continue to believe that there’s a correct criterion of rightness—but just drop the belief that the correct criterion of rightness will also satisfy “No Self-Effacement.”
It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.
One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn’t even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property P—but just drop the assumption that these things will also have both property Q1 and property Q2.
I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you’re just saying it’s “often” natural, and I suppose it’s natural in some cases and not others. But I think we may disagree on how often it’s natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)
In particular, I’m curious what makes you optimistic about finding a “correct” criterion of rightness. In the case of the politician, it seems clear that learning they don’t have some of the properties you thought shouldn’t call into question whether they exist at all.
But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there’s no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I’m not sure I understand why.
My best guess, particularly informed by reading through footnote 15 on your Realism and Rationality post, is that when faced with ethical dilemmas (like your torture vs lollipop examples), it seems like there is a correct answer. Does that seem right?
(I realize at this point we’re talking about intuitions and priors on a pretty abstract level, so it may be hard to give a good answer.)
I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you’re just saying it’s “often” natural, and I suppose it’s natural in some cases and not others. But I think we may disagree on how often it’s natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)
In particular, I’m curious what makes you optimistic about finding a “correct” criterion of rightness. In the case of the politician, it seems clear that learning they don’t have some of the properties you thought shouldn’t call into question whether they exist at all.
But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there’s no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I’m not sure I understand why.
Hey again!
I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So here’s an over-long document I probably shouldn’t have written, which you are under no social obligation to read.
I think there’s a key piece of your thinking that I don’t quite understand / disagree with, and it’s the idea that normativity is irreducible.
I think I follow you that if normativity were irreducible, then it wouldn’t be a good candidate for abandonment or revision. But that seems almost like begging the question. I don’t understand why it’s irreducible.
Suppose normativity is not actually one thing, but is a jumble of 15 overlapping things that sometimes come apart. This doesn’t seem like it poses any challenge to your intuitions from footnote 6 in the document (starting with “I personally care a lot about the question: ‘Is there anything I should do, and, if so, what?’”). And at the same time it explains why there are weird edge cases where the concept seems to break down.
So few things in life seem to be irreducible. (E.g. neither Eric nor Ben is irreducible!) So why would normativity be?
[You also should feel under no social obligation to respond, though it would be fun to discuss this the next time we find ourselves at the same party, should such a situation arise.]
This is a good discussion! Ben, thank you for inspiring so many of these different paths we’ve been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions you’ve been sharing are widespread enough that it’s very worthwhile to have public discussion on these points.
Therefore, when a member of the rationalist community uses the word “decision theory” to refer to a decision procedure, they are talking about something that’s pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems don’t directly speak to the questions that most academic “decision theorists” are actually debating with one another.
On the contrary:
MIRI is more interested in identifying generalizations about good reasoning (“criteria of rightness”) than in fully specifying a particular algorithm.
MIRI does discuss decision algorithms in order to better understand decision-making, but this isn’t different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzenius’ response to this wasn’t to go “algorithms are uncouth in our field”; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
MIRI does discuss “what decision procedure performs best”, but this isn’t any different from traditional arguments in the field like “naive EDT is wrong because it performs poorly in the smoking lesion problem”. Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isn’t different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a “decision theory we should build into future AI systems”. Again, this doesn’t seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.
I think it’s totally conceivable that no criterion of rightness is correct (e.g. because the concept of a “criterion of rightness” turns out to be some spooky bit of nonsense that doesn’t really map onto anything in the real world.)
Could you give an example of what the correctness of a meta-criterion like “Don’t Make Things Worse” could in principle consist in?
I’m not looking here for a “reduction” in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell what’s “decision-theoretically normative” in cases like this.
Human brains didn’t evolve to have a primitive “normativity detector” that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like “things brains value as ends”, “things that are useful for various ends”, “things that accurately map states of affairs”...
When I think of other examples of normativity, my sense is that in every case there’s at least one good account of why a human might be able to distinguish “truly” normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):
1. If I discover two alien species who disagree about the truth-value of “carbon atoms have six protons”, I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.
2. If I discover two alien species who disagree about the truth value of “pawns cannot move backwards in chess” or “there are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmetic”, then I can explain the rules of ‘proving things about chess’ or ‘proving things about PA’ as a symbol game, and write down strings of symbols that collectively constitute a ‘proof’ of the statement in question.
I can then assert that if any member of any species plays the relevant ‘proof’ game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.
(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once we’ve accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between “follow through on the proof process” and “just start doodling things or writing random letters down”.)
The above doesn’t answer questions like “do the relevant symbols have Platonic objects as truthmakers or referents?”, or “why do we live in a consistent universe?”, or the like. But the above answer seems sufficient for rejecting any claim that there’s something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödel’s first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.
3. If I discover two alien species who disagree about the truth-value of “suffering is terminally valuable”, then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches what’s encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is “wrong”.
Alternatively, I can make different “wrong” predicates for each species: wronghuman, wrongalien1, wrongalien2, wronghousecat, etc.
This has the disadvantage of maybe making it sound like all these values are on “equal footing” in an internally inconsistent way (“it’s wrong to put undue weight on what’s wronghuman!”, where the first “wrong” is secretly standing in for “wronghuman”), but has the advantage of making it easy to see why the aliens’ disagreement might be important and substantive, while still allowing that aliens’ normative claims can be wrong (because they can be mistaken about their own core values).
The details of how to go from a brain to an encoding of “what’s right” seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, it’s going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.
If a criterion like “Don’t Make Things Worse” deserves a lot of weight, I want to know what that weight is coming from.
If the answer is “I know it has to come from something, but I don’t know what yet”, then that seems like a perfectly fine placeholder answer to me.
If the answer is “This is like the ‘terminal values’ case, in that (I hypothesize) it’s just an ineradicable component of what humans care about”, then that also seems structurally fine, though I’m extremely skeptical of the claim that the “warm glow of feeling causally efficacious” is important enough to outweigh other things of great value in the real world.
If the answer is “I think ‘Don’t Make Things Worse’ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in life”, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.
I mostly agree with this. I think the disagreement between CDT and FDT/UDT advocates is less about definitions, and more about which of these things feels more compelling:
1. On the whole, FDT/UDT ends up with more utility.
(I think this intuition tends to hold more force with people the more emotionally salient “more utility” is to you. E.g., consider a version of Newcomb’s problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your child’s life.)
2. I’m not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I’ve already observed that the second box is empty), I’m “getting away with something” and getting free utility that the FDT agent would miss out on.
(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that it’s “too late” for one-boxing to get you any more utility in this world.)
There are other considerations too, like how much it matters to you that CDT isn’t self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. It’s fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but it’s still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.
If a more disciplined and well-prepared version of you would have one-boxed, then isn’t there something off about saying that two-boxing is in any sense “correct”? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).
Mind you, if the sentence “CDT is the most rational decision theory” is true in some substantive, non-trivial, non-circular sense, then I’m inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think it’s better to accept that truth than to linguistically conceal it.
But the arguments I’ve seen for “CDT is the most rational decision theory” to date have struck me as either circular, or as reducing to “I know CDT doesn’t get me the most utility, but something about it just feels right”.
It’s fine, I think, if “it just feels right” is meant to be a promissory note for some forthcoming account — a clue that there’s some deeper reason to favor CDT, though we haven’t discovered it yet. As the FDT paper puts it:
These are odd conclusions. It might even be argued that sufficiently odd behavior provides evidence that what FDT agents see as “rational” diverges from what humans see as “rational.” And given enough divergence of that sort, we might be justified in predicting that FDT will systematically fail to get the most utility in some as-yet-unknown fair test.
On the other hand, if “it just feels right” is meant to be the final word on why “CDT is the most rational decision theory”, then I feel comfortable saying that “rational” is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.
My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.
Superstitious people often believe that it’s possible to directly causally influence things across great distances of time and space. At a glance, FDT’s prescription (“one-box, even though you can’t causally affect whether the box is full”) as well as its account of how and why this works (“you can somehow ‘control’ the properties of abstract objects like ‘decision functions’”) seem weird and spooky in the manner of a superstition.
FDT’s response: if a thing seems spooky, that’s a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failure—in the case of decision theory, some systematic loss of utility that isn’t balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool “magical thinking” starts to ring rather hollow. At that point, it’s necessary to consider the possibility that FDT is counter-intuitive but correct (like Einstein’s “spukhafte Fernwirkung”), rather than magical.
In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:
2. I’m not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I’ve already observed that the second box is empty), I’m “getting away with something” and getting free utility that the FDT agent would miss out on.
The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the “deterministic subprocess” view of our decision-making, we would find nothing strange about the idea that it’s sometimes right for this subprocess to do locally incorrect things for the sake of better global results.
E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brain’s decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then there’s nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, it’s just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.
(Note how a lot of these points are more intuitive in CS language. I don’t think it’s a coincidence that people coming from CS were able to improve on academic decision theory’s ideas on these points; I think it’s related to what kinds of stumbling blocks get in the way of thinking in these terms.)
Suppose you initially tell yourself:
“I’m going to one-box in all strictly-future transparent Newcomb problems, since this produces more expected causal (and evidential, and functional) utility. One-boxing and receiving $1M in 99% of future states is worth the $1000 cost of one-boxing in the other 1% of future states.”
Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.
I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that you’re a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDT’s decision seem reasonable in this case.
In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that we’d been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you “getting away with something”, it’s really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).
I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments I’ve seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.
In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:
I’m not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I’ve already observed that the second box is empty), I’m “getting away with something” and getting free utility that the FDT agent would miss out on.
The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the “deterministic subprocess” view of our decision-making, we would find nothing strange about the idea that it’s sometimes right for this subprocess to do locally incorrect things for the sake of better global results.
Is the following a roughly accurate re-characterization of the intuition here?
“Suppose that there’s an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesn’t really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there’s no relevant sense in which the agent should have two-boxed.”
If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent “should have” one-boxed—since in a physically deterministic sense it could not have. Therefore, I think it’s probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other.
In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So—insofar as we accept this kind of objection from determinism—there seems to be something problematically non-naturalistic about discussing what “would have happened” if we built in one decision procedure or another.
Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there’s no relevant sense in which the agent should have two-boxed.”
No, I don’t endorse this argument. To simplify the discussion, let’s assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, there’s only one possible outcome; but it would be odd to say that a decision stops being a decision just because it’s determined by something. (What’s the alternative?)
I do endorse “given the predictor’s perfect accuracy, it’s impossible for the P_UDT agent to two-box and come away with $1,001,000”. I also endorse “given the predictor’s perfect accuracy, it’s impossible for the P_CDT agent to two-box and come away with $1,001,000″. Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesn’t mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.
(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this “dominance” argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don’t think agents “should” try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcomb’s problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)
In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also “could have two-boxed”, but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what they’ll choose, the uncertainty goes away as soon as they see whether the box is full.
In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So—insofar as we accept this kind of objection from determinism—there seems to be something problematically non-naturalistic about discussing what “would have happened” if we built in one decision procedure or another.
No, there’s nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.
The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesn’t mean that we can’t choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either “I think I’ll follow P_FDT from now on” or “I think I’ll follow P_CDT from now on”. Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make “real decisions”, for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.
It isn’t problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/impossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are “live”, are possible, until the very moment we decide.
I certainly don’t think agents “should” try to achieve outcomes that are impossible from the problem specification itself.
I think you need to make a clearer distinction here between “outcomes that don’t exist in the universe’s dynamics” (like taking both boxes and receiving $1,001,000) and “outcomes that can’t exist in my branch” (like there not being a bomb in the unlucky case). Because if you’re operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations).
And, to be clear, I do think agents “should” try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents “should” try to achieve outcomes that are already known to have happened from the problem specification including observations.
As an example, if you’re in Parfit’s Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where you’re deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.
Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this “dominance” argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don’t think agents “should” try to achieve outcomes that are impossible from the problem specification itself.
Suppose that we accept the principle that agents never “should” try to achieve outcomes that are impossible from the problem specification—with one implication being that it’s false that (as R_CDT suggests) agents that see a million dollars in the first box “should” two-box.
This seems to imply that it’s also false that (as R_UDT suggests) an agent that sees that the first box is empty “should” one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you don’t want to say that.
I think I probably still don’t understand your objection here—so I’m not sure this point is actually responsive to it—but I initially have trouble seeing what potential violations of naturalism/determinism R_CDT could be committing that R_UDT would not also be committing.
(Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)
But the arguments I’ve seen for “CDT is the most rational decision theory” to date have struck me as either circular, or as reducing to “I know CDT doesn’t get me the most utility, but something about it just feels right”.
It seems to me like they’re coming down to saying something like: the “Guaranteed Payoffs Principle” / “Don’t Make Things Worse Principle” is more core to rational action than being self-consistent. Whereas others think self-consistency is more important.
Mind you, if the sentence “CDT is the most rational decision theory” is true in some substantive, non-trivial, non-circular sense
It’s not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn’t it come down to which principles you favor?
Maybe you could say FDT is more elegant. Or maybe that it satisfies more of the intuitive properties we’d hope for from a decision theory (where elegance might be one of those). But I’m not sure that would make the justification less-circular per se.
I guess one way the justification for CDT could be more circular is if the key or only principle that pushes in favor of it over FDT can really just be seen as a restatement of CDT in a way that the principles that push in favor of FDT do not. Is that what you would claim?
Whereas others think self-consistency is more important.
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
It’s not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn’t it come down to which principles you favor?
FDT gets you more utility than CDT. If you value literally anything in life more than you value “which ritual do I use to make my decisions?”, then you should go with FDT over CDT; that’s the core argument.
This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there’s no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).
The latter argument for CDT isn’t circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
I do think the argument ultimately needs to come down to an intuition about self-effacingness.
The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.
But there’s nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there’s also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.
So why shouldn’t I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.
More formally, it seems like the argument needs to be something along these lines:
Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Therefore—as an implication of points (3) and (4) -- R_CDT is not true.
Whether you buy the “No Self-Effacement” assumption in Step 4 -- or, alternatively, the countervailing “Don’t Make Things Worse” assumption that supports R_CDT—seems to ultimately be a mattter of intuition. At least, I don’t currently know what else people can appeal to here to resolve the disagreement.
[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn’t specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don’t think this ambiguity matters much for the argument.]]
[[SECOND SIDENOTE: I’m using the phrase “self-effacing” rather than “self-contradictory” here, because I think it’s more standard and because “self-contradictory” seems to suggest logical inconsistency.]]
But there’s nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT.
If the thing being argued for is “R_CDT plus P_SONOFCDT”, then that makes sense to me, but is vulnerable to all the arguments I’ve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT’s “Don’t Make Things Worse” principle.
If the thing being argued for is “R_CDT plus P_FDT”, then I don’t understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over “R_FDT plus P_FDT”? (Indeed, what difference between the two views would be intended here?)
So why shouldn’t I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be not self-effacing.
The argument against “R_CDT plus P_SONOFCDT” doesn’t require any mention of self-effacingness; it’s entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.
The argument against “R_CDT plus P_FDT” seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don’t understand what this view would mean or why anyone would endorse it (and I don’t take you to be endorsing it).
For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don’t think this ambiguity matters much for the argument.
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what “expected utility” means.
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT “gets you less utility” rather than the point that P_SONOFCDT “gets you less utility.” So my comment was aiming to explain why I don’t think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the “No Self-Effacement Principle”). But it sounds like you might agree that this fact doesn’t on its own provide a strong challenge.
If the thing being argued for is “R_CDT plus P_SONOFCDT”, then that makes sense to me, but is vulnerable to all the arguments I’ve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT’s “Don’t Make Things Worse” principle.
In response to the first argument alluded to here: “Gets the most [expected] utility” is ambiguous, as I think we’ve both agreed.
My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So—if we evaluate the expected utility of a decision to adopt a policy from a casual perspective—it seems to me that P_SONOFCDT “gets the most expected utility.”
If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may “get the most utility” (because policy adoption decisions may be non-causally correlated.)
Apologies if I’m off-base, but it reads to me like you might be suggesting an argument along these lines:
R_CDT says that it is rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
Therefore R_CDT is not true.
The natural response to this argument is that it’s not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its “expected utility” defined in causal terms. So someone starting from the position that R_CDT is true obviously won’t accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its “expected utility” defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.
Overall at least this particular simple argument—that R_CDT is false because P_SONOFCDT gets less “expected utility” as defined in evidential/quasi-evidential terms—would seemingly fail to due circularity. But you may have in mind a different argument.
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what “expected utility” means.
I felt confused by this comment. Doesn’t even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you don’t want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more “rational” decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?
You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit to—P_SONOFCDT—sometimes outputs decisions that definitely make things worse. So “Don’t Make Things Worse” implies that some of the decisions outputted by P_SONOFCDT are irrational.
But I still don’t see what the argument is here unless we’re assuming “No Self-Effacement.” It still seems to me like we have a few initial steps and then a missing piece.
(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the “Don’t Make Things Worse” Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.
The “No Self-Effacement” Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume “No Self-Effacement” in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if we’re not assuming “No Self-Effacement,” then it’s not clear to me how we get there.
Actually, in the context of this particular argument, I suppose we don’t really have the option of assuming that “No Self-Effacement” is true—because this assumption would be inconsistent with the earlier assumption that “Don’t Make Things Worse” is true. So I’m not sure it’s actually possible to make this argument schema work in any case.
There may be a pretty different argument here, which you have in mind. I at least don’t see it yet though.
There may be a pretty different argument here, which you have in mind. I at least don’t see it yet though.
Perhaps the argument is something like:
“Don’t make things worse” (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_′s, and not about P_′s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn’t get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_′s, and not about P_′s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn’t get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
Here are two logically inconsistent principles that could be true:
Don’t Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they’re mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.
One could argue that R_CDT sympathists don’t actually have much stronger intuitions regarding the first principle than the second—i.e. that their intuitions aren’t actually very “targeted” on the first one—but I don’t think that would be right. At least, it’s not right in my case.
A more viable strategy might be to argue for something like a meta-principle:
The ‘Don’t Make Things Worse’ Meta-Principle: If you find “Don’t Make Things Worse” strongly intuitive, then you should also find “Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse” just about as intuitive.
If the meta-principle were true, then I guess this would sort of imply that people’s intuitions in favor of “Don’t Make Things Worse” should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.
But I don’t see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of “Don’t Make Things Worse” :)
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.
Bomb Case: Omega puts a million dollars in a transparent box if he predicts you’ll open it. He puts a bomb in the transparent box if he predicts you won’t open it. He’s only wrong about one in a trillion times.
Now suppose you enter the room and see that there’s a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don’t open the box, then nothing bad will happen to you. You’ll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.
Intuitively, this decision strikes me as deeply irrational. You’re intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.
Now, let’s step back a time step. Suppose you know that you’re sort of person who would refuse to kill yourself by detonating the bomb. You might decide that—since Omega is such an accurate predictor—it’s worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you’re making now is nonetheless making things better in expectation.
This decision strikes me as pretty intuitively rational. You’re violating the second principle—the “Don’t Commit to a Policy...” Principle—but this violation just doesn’t seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.
(This obviously just a description of my own intuitions, as they stand, though.)
It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.
By triggering the bomb, you’re making things worse from your current perspective, but making things better from the perspective of earlier you. Doesn’t that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you don’t. After updating, you think you’re either a simulation within Omega’s prediction so your action has no effect on yourself or you’re in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/inconsistency.
Giving the human tendency to change our (UDT-)utility functions by updating, it’s not clear what to do (or what is right), and I think this reduces UDT’s intuitive appeal and makes it less of a slam-dunk over CDT/EDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isn’t adequately explained in MIRI’s decision theory papers.)
Don’t Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
...
One could argue that R_CDT sympathists don’t actually have much stronger intuitions regarding the first principle than the second—i.e. that their intuitions aren’t actually very “targeted” on the first one—but I don’t think that would be right. At least, it’s not right in my case.
I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that that’s right in your case.)
But I feel like the second doesn’t quite capture what I had in mind regarding the DMTW intuition applied to P_′s.
Consider an alternate version:
If a decision would definitely make things worse, then taking that decision is not good policy.
Or alternatively:
If a decision would definitely make things worse, a rational person would not take that decision.
It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the ”… then taking that decision is not rational.” version is. And it’s only after you’ve considered prisoners’ dilemmas or Newcomb’s paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from what’s rational in the moment.
(But maybe others would disagree on how intuitive these versions are.)
EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then it’s not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.
Of course, this argument is moot if it’s true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe that’s the case? But I think it’s a little more ambiguous with the ”… is not good policy” or “a rational person would not...” versions than with the “Don’t commit to a policy...” version.
EDIT2: Does what I’m trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
If the thing being argued for is “R_CDT plus P_SONOFCDT” … If the thing being argued for is “R_CDT plus P_FDT...
Just as a quick sidenote:
I’ve been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.
If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.
The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren’t). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.
[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you’re building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner’s dilemma (e.g. some version of P_FDT). But if R_CDT is true and you’ve just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner’s dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view—which I’ve alluded to above—is that the various proposed criteria of rightness don’t in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.
This is similar to how you described it here:
Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. [...]
When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational iff taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.
We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.
This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it’s normative, it can be either an algorithm/procedure that’s being recommended, or a criterion of rightness like “a decision is rational iff taking it would cause the largest expected increase in value” (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are “normative” or “endorsed”).
Some of your discussion above seems to be focusing on the “algorithmic?” dimension, while other parts seem focused on “normative?”. I’ll say more about “normative?” here.
The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think they’re pretty concrete and crisply defined. It’s harder for me to accidentally switch topics or bundle two different concepts together when talking about “trying to optimize vs. optimizing as a side-effect”, “directly optimizing vs. optimizing via heuristics”, “initially optimizing vs. self-modifying to optimize”, or “function vs. algorithm”.
In contrast, I think “normative” and “rational” can mean pretty different things in different contexts, it’s easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of what’s at stake in the discussion.
E.g., “normative” is often used in the context of human terminal values, and it’s in this context that statements like this ring obviously true:
I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making.
If we’re treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:
Decision theory isn’t normative.
Decision theory is normative in the way that “murder is bad” or “improving aggregate welfare is good” is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).
Probability theory has obvious normative force in the context of reasoning and decision-making, but it’s not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.
A lot of the examples you’ve cited are theories from moral philosophy about what’s terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; it’s not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.
You could indeed define an agent that terminally values making CDT-style decisions, but I don’t think most proponents of CDT or EDT would claim that their disagreement with UDT/FDT comes down to a values disagreement like that. Rather, they’d claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about who’s mistaken.)
I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.)
In the twin prisoner’s dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).
I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/or uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.
Decision theory is normative in the way that “murder is bad” or “improving aggregate welfare is good” is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).
[[Disclaimer: I’m not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]]
I think the kind of decision theory that philosophers tend to work on is typically explicitly described as “normative.” (For example, the SEP article on decision theory is about “normative decision theory.”) So when I’m talking about “academic decision theories” or “proposed criteria of rightness” I’m talking about normative theories. When I use the word “rational” I’m also referring to a normative property.
I don’t think there’s any very standard definition of what it means for something to be normative, maybe because it’s often treated as something pretty close to a primitive concept, but a partial account is that a “normative theory” is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one).
Some normative theories concern “ends.” These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. I’m not sure what the best terminology is, and think this choice is probably relatively non-standard, but let’s label these “moral theories.”
Some normative theories, including “decision theories,” concern “means.” These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/implications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alternative theories) say different things. Typical versions of CDT and EDT flesh out expected utility theory in different ways to specify what the relevant measure of “expected fulfillment” is.
Moral theory and normative decision theory seem to me to have pretty much the same status. They are both bodies of theory that bear on what people should do. On some views, the division between them is more a matter of analytic convenience than anything else. For example, David Enoch, a prominent meta-ethicist, writes: “In fact, I think that for most purposes [the line between the moral and the non-moral] is not a line worth worrying about. The distinction within the normative between the moral and the non-moral seems to me to be shallow compared to the distinction between the normative and the non-normative” (Taking Morality Seriously, 86).
One way to think of moral theories and normative decision theories is as two components that fit together to form more fully specified theories about what people should do. Moral theories describe the ends people should pursue; given these ends, decision theories then describe what actions people should take when in states of uncertainty. To illustrate, two examples of more complete normative theories that combine moral and decision-theoretic components would be: “You should take whatever action would in expectation cause the largest increase in the fulfillment of your preferences” and “You should take whatever action would, if you took it, lead you to anticipate the largest expected amount of future happiness in the world.” The first is subjectivism combined with CDT, while the second is total view hedonistic utilitarianism combined with EDT.
(On this conception, a moral theory is not a description of “an arbitrary terminal value of human beings.” Decision theory here also is not “the study of which decision-making methods humans happen to terminally prefer to employ.” These are both theories are about what people should do, rather than theories about about what people’s preferences are.)
Normativity is obviously pretty often regarded as a spooky or insufficiently explained thing. So a plausible position is normative anti-realism: It might be the case that no normative claims are true, either because they’re all false or because they’re not even well-formed enough to take on truth values. If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn’t really have an answer.
In the twin prisoner’s dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).
If I’m someone with a twin and I’m implementing P_CDT, I still don’t think I will choose to modify myself to cooperate in twin prisoner’s dilemmas. The reason is that modifying myself won’t cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.
(The fact P_CDT agents won’t modify themselves to cooperate with their twins could of course be interpretted as a mark against R_CDT.)
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!
If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn’t really have an answer.
Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeks’ astronomical observations, and some of their theories and predictive tools, were still true and useful.
I think that terms like “normative” and “rational” are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser’s pluralistic moral reductionism).
I would say that (1) some philosophers use “rational” in a very human-centric way, which is fine as long as it’s done consistently; (2) others have a much more thin conception of “rational”, such as ‘tending to maximize utility’; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of “rationality”, but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.
I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcomb’s problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer who’s accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)
I’d expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/3 than type-1. (And in this case, even type-1 analyses of “rationality” can be helpful for bringing various widespread background assumptions to light.)
If I’m someone with a twin and I’m implementing P_CDT, I still don’t think I will choose to modify myself to cooperate in twin prisoner’s dilemmas. The reason is that modifying myself won’t cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.
This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDT’s prescriptions from now on.
Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasn’t happened yet.
(And yes, by the time you actually end up in the prisoner’s dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDT’s prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDT’s future prescriptions.)
(If it helps, don’t think of the copy of you as “you”: just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!
Thank you for taking the time to respond as well! :)
I think that terms like “normative” and “rational” are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser’s pluralistic moral reductionism).
I would say that (1) some philosophers use “rational” in a very human-centric way, which is fine as long as it’s done consistently; (2) others have a much more thin conception of “rational”, such as ‘tending to maximize utility’; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of “rationality”, but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.
I’m not positive I understand what (1) and (3) are referring to here, but I would say that there’s also at least a fourth way that philosophers often use the word “rational” (which is also the main way I use the word “rational.”) This is to refer to an irreducibly normative concept.
The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. “reduced”). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible word—it requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial “basis” of primitive concepts that we use to describe the rest of the concepts.
Some examples of concepts that are arguably irreducible are “truth,” “set,” “property,” “physical,” “existance,” and “point.” Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further.
To focus on the example of “truth,” some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that “X is true” what they really mean or should mean is “I personally believe X” or “believing X is good for you.” But I think these suggested reductions pretty obviously don’t entirely capture what people mean when they say “X is true.” The phrase “X is true” also has an important meaning that is not amenable to this sort of reduction.
[[EDIT: “Truth” may be a bad example, since it’s relatively controversial and since I’m pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argument that some concepts will necessarily be irreducible, the particular choice of example used here isn’t essential to the overall point.]]
Some philosophers also employ normative concepts that they say cannot be reduced in terms of non-normative (e.g. psychological) properties. These concepts are said to be irreducibly normative.
For example, here is Parfit on the concept of a normative reason (OWM, p. 1):
We can have reasons to believe something, to do something, to have some desire or aim, and to have many other attitudes and emotions, such as fear, regret, and hope. Reasons are given by facts, such as the fact that someone’s finger-prints are on some gun, or that calling an ambulance would save someone’s life.
It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words. We must explain such concepts in a different way, by getting people to think thoughts that use these concepts. One example is the thought that we always have a reason to want to avoid being in agony.
When someone says that a concept they are using is irreducible, this is obviously some reason for suspicion. A natural suspicion is that the real explanation for why they can’t give a useful description is that the concept is seriously muddled or fails to grip onto anything in the real world. For example, whether this is fair or not, I have this sort of suspicion about the concept of “dao” in daoist philosophy.
But, again, it will necessarily be the case that some useful and valid concepts are irreducible. So we should sometimes take evocations of irreducible concepts seriously. A concept that is mostly undefined is not always problematically “underdefined.”
When I talk about “normative anti-realism,” I mostly have in mind the position that claims evoking irreducably normative concepts are never true (either because these claims are all false or because they don’t even have truth values). For example: Insofar as the word “should” is being used in an irreducibly normative sense, there is nothing that anyone “should” do.
[[Worth noting, though: The term “normative realism” is sometimes given a broader definition than the one I’ve sketched here. In particular, it often also includes a position known as “analytic naturalist realism” that denies the relevance of irreducibly normative concepts. I personally feel I understand this position less well and I think sometimes waffle between using the broader and narrower definition of “normative realism.” I also more generally want to stress that not everyone who makes claims about “criterion of rightness” or employs other seemingly normative language is actually a normative realist in the narrow or even broad sense; what I’m doing here is just sketching one common especially salient perspective.]]
One motivation for evoking irreducibly normative concepts is the observation that—in the context of certain discussions—it’s not obvious that there’s any close-to-sensible way to reduce the seemingly normative concepts that are being used.
For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of “a rational choice” to the concept of “a winning choice” (or, in line with the type-2 conception you mention, a “utility-maximizing choice”). It seems difficult to make sense of a lot of basic claims about rationality if we use this reduction—and other obvious alternative reductions don’t seem to fair much better. To mostly quote from a comment I made elsewhere:
Suppose we want to claim that it is rational to try to maximize the expected winning (i.e. the expected fulfillment of your preferences). Due to randomness/uncertainty, though, an agent that tries to maximize expected “winning” won’t necessarily win compared to an agent that does something else. If I spend a dollar on a lottery ticket with a one-in-a-billion chance of netting me a billion-and-one “win points,” then I’m taking the choice that maximizes expected winning but I’m also almost certain to lose. So we can’t treat “the rational action” as synonymous with “the action taken by an agent that wins.”
We can try to patch up the issue here by reducing “the rational action” to “the action that is consistent with the VNM axioms,” but in fact either action in this case is consistent with the VNM axioms. The VNM axioms don’t imply that an agent must maximize the expected desirability of outcomes. They just imply that an agent must maximize the expected value of some function. It is totally consistent with the axioms, for example, to be effectively risk averse and instead maximize the expected square root of desirability. If we try to define “the action I should take” in this way, then the claim “it is rational to act consistently with the VNM axioms” also becomes an empty tautology.
We could of course instead reduce “the rational action” to “the action that maximizes expected winning.” But now, of course, the claim “it is rational to maximize expected winning” no longer has any substantive content. When we make this claim, do we really mean to be stating an empty tautology? And do we really consider it trivially incoherent to wonder—e.g. in a Pascal’s mugging scenario—whether it might be “rational” to take an action other than the one that maximizes expected winning? If not, then this reduction is a very poor fit too.
It ultimately seems hard, at least to me, to make non-vacuous true claims about what it’s “rational” to do withoit evoking a non-reducible notion of “rationality.” If we are evoking a non-reducible notion of rationality, then it makes sense that we can’t provide a satisfying reduction.
At the same time, though, I do think there are also really good and hard-to-counter epistemological objections to the existance of irreducibly normative properties (e.g. the objection described in this paper). You might also find the difficulty of reducing normative concepts a lot less obvious-seeming or problematic than I do. You might think, for example, that the difficulty of reducing “rationality” is less like the difficulty of reducing “truth” (which IMO mainly reflects the fact that truth is an important primitive concept) and more like the difficulty of defining the word “soup” in a way that perfectly matches our intuitive judgments about what counts as “soup” (which IMO mainly reflects the fact that “soup” is a high-dimensional concept). So I definitely don’t want to say normative realism is obviously or even probably right.
I mainly just want to communicate the sort of thing that I think a decent chunk of philosophers have in mind when they talk about a “rational decision” or a “criterion of rightness.” Although, of course, philosophy being philosophy, plenty of people do of course have in mind plenty of different things.
So, as an experiment, I’m going to be a very obstinate reductionist in this comment. I’ll insist that a lot of these hard-seeming concepts aren’t so hard.
Many of them are complicated, in the fashion of “knowledge”—they admit an endless variety of edge cases and exceptions—but these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where there’s a simple core we can point to, that core generally isn’t mysterious.
It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it “isn’t mysterious”, I mean it’s pretty easy to see how the concept can crop up in human thought even if it doesn’t belong on the short list of deep fundamental cosmic structure terms.
I would say that there’s also at least a fourth way that philosophers often use the word “rational,” which is also the main way I use the word “rational.” This is to refer to an irreducibly normative concept.
Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like “love,” “knowledge,” “France”) , or it’s not (in which case it goes in bucket 2).
Some examples of concepts that are arguably irreducible are “truth,” “set,” “property,” “physical,” “existance,” and “point.”
Picking on the concept here that seems like the odd one out to me: I feel confident that there isn’t a cosmic law (of nature, or of metaphysics, etc.) that includes “truth” as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like “intentionality/reference”, “assertion”, or “state of affairs”, though the former two strike me as easy to explain in simple physical terms.
Mundane empirical “truth” seems completely straightforward. Then there’s the truth of sentences like “Frodo is a hobbit”, “2+2=4”, “I could have been the president”, “Hamburgers are more delicious than battery acid”… Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word “true” to refer to all these different statements is pre-philosophical, and doesn’t reflect anything deeper than that “each of these sentences at least superficially looks like it’s asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic community”.
I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think there’s also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of “Frodo is a hobbit” doesn’t mean you’re learning something deep about the universe (or even something particularly deep about human cognition) in the process.
[Parfit:] It is hard to explain the concept of a reason, or what the phrase ‘a reason’ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But ‘counts in favour of’ means roughly ‘gives a reason for’. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words.
Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, “To what extent is observation o a [WORD] to believe hypothesis h?” When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, “To what extent is P(h)=0.7 a [WORD] to choose action a?” As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.
When two robots meet, we can moreover add that they negotiate a joint “compromise” goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using “[WORD]” where an action is being evaluated relative to the joint goal, not just the robot’s original goal.
Thus when Robot A tells Robot B “I assign probability 90% to ‘it’s noon’, which is [WORD] to have lunch”, A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and B’s joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)
If you decide to relabel “[WORD]” as “reason”, I claim that this captures a decent chunk of how people use the phrase “a reason”. “Reason” is a suitcase word, but that doesn’t mean there are no similarities between e.g. “data my goals endorse using to adjust the probability of a given hypothesis” and “probabilities-of-hypotheses my goals endorse using to select an action”), or that the similarity is mysterious and ineffable.
(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)
For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of “a rational choice” to the concept of “a winning choice” (or, in line with the type-2 conception you mention, a “utility-maximizing choice”).
That essay isn’t trying to “reduce” the term “rationality” in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of “what the right thing to do is”, they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).
Therefore: if you’re going to use words like “rationality”, make sure that the words in question won’t cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you aren’t using “rationality” in a safe “nailed-to-utility” way, make sure that you’re willing to turn on a time and stop being “rational” the second your conception of rationality starts telling you to throw away value.
It ultimately seems hard, at least to me, to make non-vacuous true claims about what it’s “rational” to do withoit evoking a non-reducible notion of “rationality.”
“Rationality” is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just “(systematized) winning” but (as noted in the essay) “Bayesian reasoning”, or in Rationality: Appreciating Cognitive Algorithms, “cognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievement”. In philosophy, the list is a lot longer.
The common denominator seems to largely be “something something reasoning / deliberation” plus (as you note) “something something normativity / desirability / recommendedness / requiredness”.
The idea of “normativity” doesn’t currently seem that mysterious to me either, though you’re welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:
Goals/preferences/values, especially overridingly strong ones.
Encouraged, endorsed, mandated, or praised conduct.
Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I don’t think it’s particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.
I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.
(I don’t really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort I’ve just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments you’ve just made cover some reasons to suspect this.)
The key point is just that when these philosophers say that “Action X is rational,” they are explicitly reporting that they do not mean “Action X suits my terminal preferences” or “Action X would be taken by an agent following a policy that maximizes lifetime utility” or any other such reduction.
I think that when people are very insistent that they don’t mean something by their statements, it makes sense to believe them. This implies that the question they are discussing—“What are the necessary and sufficient conditions that make a decision rational?”—is distinct from questions like “What decision would an agent that tends to win take?” or “What decision procedure suits my terminal preferences?”
It may be the case that the question they are asking is confused or insensible—because any sensible question would be reducible—but it’s in any case different. So I think it’s a mistake to interpret at least these philosophers’ discussions of “decisions theories” or “criteria of rightness” as though they were discussions of things like terminal preferences or winning strategies. And it doesn’t seem to me like the answer to the question they’re asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.
[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. It’s less clear to me how related or unrelated the thing they’re talking about is to issues of interest to MIRI. I think that the conception of rationality I’m discussing here mainly just presents an especially clear case.]]
This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it’s normative, it can be either an algorithm/procedure that’s being recommended, or a criterion of rightness like “a decision is rational iff taking it would cause the largest expected increase in value” (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are “normative” or “endorsed”).
Just on this point: I think you’re right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context:
Decisions
Decision procedures
The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility)
The set of properties that makes a decision rational (“criterion of rightness”)
A claim about what the criterion of rightness is (“normative decision theory”)
The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness)
(4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity.
My current-although-not-firmly-held view is also that (6) probably isn’t very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.
I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.
Here’s another go:
Let’s suppose that some decisions are rational and others aren’t. We can then ask: What is it that makes a decision rational? What are the necessary and/or sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase “decision theory” in this context typically refers to a claim about necessary and/or sufficient conditions for a decision being rational. To use different jargon, in this context a “decision theory” refers to a proposed “criterion of rightness.”
When philosophers talk about “CDT,” for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, “CDT” is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambiguity, let’s label this claim R_CDT.
We can also talk about “decision procedures.” A decision procedure is just a process or algorithm that an agent follows when making decisions.
For each proposed criterion of rightness, it’s possible to define a decision procedure that only outputs decisions that fulfill the criterion. For example, we can define P_CDT as a decision procedure that involves only taking actions that R_CDT claims are rational.
My understanding is that when philosophers talk about “CDT,” they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true don’t generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves. R_CDT claims that we should do whatever will have the best effects—and, in many cases, building agents that follow a decision procedure other than P_CDT is likely to have the best effects. More generally: Most proposed criteria of rightness imply that it can be rational to build agents that sometimes behave irrationally.
One possible criterion of rightness, which I’ll call R_UDT, is something like this: An action is rational only if it would have been chosen by whatever decision procedure would have produced the most expected value if consistently followed over an agent’s lifetime. For example, this criterion of rightness says that it is rational to one-box in the transparent Newcomb scenario because agents who consistently follow one-boxing policies tend to do better over their lifetimes.
I could be wrong, but I associate the “success-first approach” with something like the claim that R_UDT is true. This would definitely constitute a really interesting and significant divergence from mainstream opinion within academic decision theory. Academic decision theorists should care a lot about whether or not it’s true.
But I’m also not sure if it matters very much, practically, whether R_UDT or R_CDT is true. It’s not obvious to me that they recommend building different kinds of decision procedures into AI systems. For example, both seem to recommend building AI systems that would one-box in the transparent Newcomb scenario.
I disagree that any of the distinctions here are purely semantic. But one could argue that normative anti-realism is true. In this case, there wouldn’t really be any such thing as the criterion of rightness for decisions. Neither R_CDT nor R_UDT nor any other proposed criterion would be “correct.”
In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.
[[EDIT: Note that Will also emphasizes the importance of the criterion-of-rightness vs. decision-procedure distinction in his critique of the FDT paper: “[T]hey’re [most often] asking what the best decision procedure is, rather than what the best criterion of rightness is… But, if that’s what’s going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So it’s odd to have a whole paper comparing them side-by-side as if they are rivals.”]]
I agree that these three distinctions are important:
“Picking policies based on whether they satisfy a criterion X” vs. “Picking policies that happen to satisfy a criterion X”. (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
“Trying to follow a decision rule Y ‘directly’ or ‘on the object level’” vs. “Trying to follow a decision rule Y by following some other decision rule Z that you think satisfies Y”. (E.g., trying to naïvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits you’ve come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
“A decision rule that prescribes outputting some action or policy and doesn’t care how you do it” vs. “A decision rule that prescribes following a particular set of cognitive steps that will then output some action or policy”. (E.g., a rule that says ‘maximize the aggregate welfare of moral patients’ vs. a specific mental algorithm intended to achieve that end.)
The first distinction above seems less relevant here, since we’re mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly “trying to do what’s right”.
As a side-note, I do want to emphasize that from the MIRI cluster’s perspective, it’s fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the system’s alignment-relevant properties aren’t obscured and the system ends up safe and reliable).
The main reason to work on decision theory in AI alignment has never been “What if people don’t make AI ‘decision-theoretic’ enough?” or “What if people mistakenly think CDT is correct and so build CDT into their AI system?” The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that we’ve even been misunderstanding basic things at the level of “decision-theoretic criterion of rightness”.
It’s not that I want decision theorists to try to build AI systems (even notional ones). It’s that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. That’s part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
The second distinction (“following a rule ‘directly’ vs. following it by adopting a sub-rule or via self-modification”) seems more relevant. You write:
Far from being a distinction proponents of UDT/FDT neglect, this is one of the main grounds on which UDT/FDT proponents criticize CDT (from within the “success-first” tradition). This is because agents that are reflectively inconsistent in the manner of CDT—ones that take actions they know they’ll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.
A human following CDT should endorse “stop following CDT,” since CDT isn’t self-endorsing. It’s not even that they should endorse “keep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT ends”; they need to completely abandon CDT even at the meta-level of “what sort of decision rule should I follow?” and modify themselves into purely following an entirely new decision rule, or else they’ll continue to perform poorly by CDT’s lights.
The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, “son-of-CDT”, is roughly:
Have whatever disposition-to-act gets the most utility, unless I’m in future situations like “a twin prisoner’s dilemma against a perfect copy of my future self where the copy was forked from me before I started following this rule”, in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisoner’s dilemmas with that were copied from my brain after I started following this rule).
The fact that CDT doesn’t endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories don’t), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.
But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in “Toward Idealized Decision Theory”, where “CDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action.”
In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agents’ due to events that happened after she turned 20 (such as “the summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theory”). But she’ll refuse to coordinate for reasons like “we hung out a lot the summer before my 20th birthday”, “we spent our whole childhoods and teen years living together and learning from the same teachers”, and “we all have similar decision-making faculties due to being members of the same species”. There’s no principled reason to draw this temporal distinction; it’s just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.
Regarding the third distinction (“prescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of output”), I’d say that it’s primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called “Functional Decision Theory” and not (e.g.) “Algorithmic Decision Theory”: the focus is explicitly on “what outcomes do you produce?”, not on how you produce them.
(Thus, an FDT agent can cooperate with another agent whenever the latter agent’s input-output relations match FDT’s prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)
The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the “understand what good reasoning is” project (e.g., logical uncertainty) and how they relate to decision theory.
The second distinction here is most closely related to the one I have in mind, although I wouldn’t say it’s the same. Another way to express the distinction I have in mind is that it’s between (a) a normative claim and (b) a process of making decisions.
“Hedonistic utilitarianism is correct” would be a non-decision-theoretic example of (a). “Making decisions on the basis of coinflips” would be an example of (b).
In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b).
I now have the sense I’m probably not doing a good job of communicating what I have in mind, though.
I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding “how decision-making works,” since normative claims don’t typically seem to have any empirical/mathematical/etc. implications. For example, to again use a non-decision-theoretic example, I don’t think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here.
I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess I’m just not sure that the question that most academic decision theorists are trying to answer—and the literature they’ve produced on it—will ultimately be very relevant.
The fact that R_CDT is “self-effacing”—i.e. the fact that it doesn’t always recommend following P_CDT—definitely does seem like a point of intuitive evidence against R_CDT.
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the “Don’t Make Things Worse Principle,” which says that: It’s not rational to take decisions that will definitely make things worse. Will’s Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his “Guaranteed Payoffs Principle.”
There’s then a question of which of these considerations is more relevant, when judging which of the two normative theories is more likely to be correct. The failure of R_UDT to satisfy the “Don’t Make Things Worse Principle” seems more important to me, but I don’t really know how to argue for this point beyond saying that this is just my intuition. I think that the failure of R_UDT to satisfying this principle—or something like it—is also probably the main reason why many philosophers find it intuitively implausible.
(IIRC the first part of Reasons and Persons is mostly a defense of the view that the correct theory of rationality may be self-effacing. But I’m not really familiar with the state of arguments here.)
I actually don’t think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account won’t cause you to achieve better outcomes here.) So I don’t think there should be any weird “Frankenstein” decision procedure thing going on.
….Thinking more about it, though, I’m now less sure how much the different normative decision theories should converge in their recommendations about AI design. I think they all agree that we should build systems that one-box in Newcomb-style scenarios. I think they also agree that, if we’re building twins, then we should design these twins to cooperate in twin prisoner’s dilemmas. But there may be some other contexts where acausal cooperation considerations do lead to genuine divergences. I don’t have very clear/settled thoughts about this, though.
I think “Don’t Make Things Worse” is a plausible principle at first glance.
One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the “Don’t Make Things Worse Principle” makes things worse.
Once you’ve already adopted son-of-CDT, which says something like “act like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this rule”, it’s not clear to me why you wouldn’t just go: “Oh. CDT has lost the thing I thought made it appealing in the first place, this ‘Don’t Make Things Worse’ feature. If we’re going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?”
A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demski’s comments:
And:
This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation.
I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy “Don’t Make Things Worse” is more relevant when trying to judge the likelihood of a criterion of rightness being correct. I’m not sure whether it’s possible to do much here other than present personal intuitions.
The point that R_UDT only violates the “Don’t Make Things Worse” principle only infrequently seems relevant, but I’m still not sure this changes my intuitions very much.
I may just be missing something, but I don’t see what this theoretical ugliness is. And I don’t intuitively find the ugliness/elegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct.
[[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because it’s costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isn’t regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism is expressing a pretty mainstream position when it says: “[T]here is nothing incoherent about proposing a decision procedure that is separate from one’s criterion of the right.… Criteria can, thus, be self-effacing without being self-refuting.” Insofar as people don’t tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories, it’s not clear why they should buy it as a slam-dunk argument against the truth of normative decision theories.]]
Sorry to drop in in the middle of this back and forth, but I am curious—do you think it’s quite likely that there is a single criterion of rightness that is objectively “correct”?
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. “don’t make things worse”, or “don’t be self-effacing”). And so far there doesn’t seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrow’s theorem, perhaps there’s just no single perfect criterion of rightness.
In other words, once we agree that CDT doesn’t make things worse, but that UDT is better as a general policy, is there anything left to argue about about which is “correct”?
EDIT: Decided I had better go and read your Realism and Rationality post, and ended up leaving a lengthy comment there.
Happy to be dropped in on :)
I think it’s totally conceivable that no criterion of rightness is correct (e.g. because the concept of a “criterion of rightness” turns out to be some spooky bit of nonsense that doesn’t really map onto anything in the real world.)
I suppose the main things I’m arguing are just that:
When a philosopher expresses support for a “decision theory,” they are typically saying that they believe some claim about what the correct criterion of rightness is.
Claims about the correct criterion of rightness are distinct from decision procedures.
Therefore, when a member of the rationalist community uses the word “decision theory” to refer to a decision procedure, they are talking about something that’s pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] don’t directly speak to the questions that most academic “decision theorists” are actually debating with one another.
I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. I’m definitely not a super hardcore R_CDT believer.
I guess here—in almost definitely too many words—is how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.)
It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.
One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isn’t even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property P—but just drop the assumption that these things will also have both property Q1 and property Q2.
This obviously a pretty abstract description, so I’ll give a few examples. (No need to read the examples if the point seems obvious.)
Ethics: I might initially be inclined to think that it’s always ethical (property P) to maximize happiness and that it’s always unethical to torture people. But then I may realize that there’s an inconsistency here: in at least rare circumstances, such as ticking time-bomb scenarios where torture can extract crucial information, there may be no decision that is both happiness maximizing (Q1) and torture-avoiding (Q2). It seems like a natural reaction here is just to drop either the belief that maximizing happiness is always ethical or that torture is always unethical. It doesn’t seem like I need to abandon my belief that some actions have the property of being ethical.
Theology: I might initially be inclined to think that God is all-knowing, all-powerful, and all-good. But then I might come to believe (whether rightly or not) that, given the existance of evil, these three properties are inconsistent. I might then continue to believe that God exists, but just drop my belief that God is all-good. (To very awkwardly re-express this in the language of properties: This would mean dropping my belief that any entity that has the property of being God also has the property of being all-good).
Politician-bashing: I might initially be inclined to characterize some politician both as an incompetent leader and as someone who’s successfully carrying out an evil long-term plan to transform the country. Then I might realize that these two characterizations are in tension with one another. A pretty natural reaction, then, might be to continue to believe the politician exists—but just drop my belief that they’re incompetent.
To turn to the case of the decision-theoretic criterion of rightness, I might initially be inclined to think that the correct criterion of rightness will satisfy both “Don’t Make Things Worse” and “No Self-Effacement.” It’s now become clear, though, that no criterion of rightness can satisfy both of these principles. I think it’s pretty reasoanble, then, to continue to believe that there’s a correct criterion of rightness—but just drop the belief that the correct criterion of rightness will also satisfy “No Self-Effacement.”
Thanks! This is helpful.
I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, you’re just saying it’s “often” natural, and I suppose it’s natural in some cases and not others. But I think we may disagree on how often it’s natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)
In particular, I’m curious what makes you optimistic about finding a “correct” criterion of rightness. In the case of the politician, it seems clear that learning they don’t have some of the properties you thought shouldn’t call into question whether they exist at all.
But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that there’s no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and I’m not sure I understand why.
My best guess, particularly informed by reading through footnote 15 on your Realism and Rationality post, is that when faced with ethical dilemmas (like your torture vs lollipop examples), it seems like there is a correct answer. Does that seem right?
(I realize at this point we’re talking about intuitions and priors on a pretty abstract level, so it may be hard to give a good answer.)
Hey again!
I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So here’s an over-long document I probably shouldn’t have written, which you are under no social obligation to read.
Thanks! Just read it.
I think there’s a key piece of your thinking that I don’t quite understand / disagree with, and it’s the idea that normativity is irreducible.
I think I follow you that if normativity were irreducible, then it wouldn’t be a good candidate for abandonment or revision. But that seems almost like begging the question. I don’t understand why it’s irreducible.
Suppose normativity is not actually one thing, but is a jumble of 15 overlapping things that sometimes come apart. This doesn’t seem like it poses any challenge to your intuitions from footnote 6 in the document (starting with “I personally care a lot about the question: ‘Is there anything I should do, and, if so, what?’”). And at the same time it explains why there are weird edge cases where the concept seems to break down.
So few things in life seem to be irreducible. (E.g. neither Eric nor Ben is irreducible!) So why would normativity be?
[You also should feel under no social obligation to respond, though it would be fun to discuss this the next time we find ourselves at the same party, should such a situation arise.]
This is a good discussion! Ben, thank you for inspiring so many of these different paths we’ve been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions you’ve been sharing are widespread enough that it’s very worthwhile to have public discussion on these points.
On the contrary:
MIRI is more interested in identifying generalizations about good reasoning (“criteria of rightness”) than in fully specifying a particular algorithm.
MIRI does discuss decision algorithms in order to better understand decision-making, but this isn’t different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzenius’ response to this wasn’t to go “algorithms are uncouth in our field”; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
MIRI does discuss “what decision procedure performs best”, but this isn’t any different from traditional arguments in the field like “naive EDT is wrong because it performs poorly in the smoking lesion problem”. Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isn’t different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a “decision theory we should build into future AI systems”. Again, this doesn’t seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.
Could you give an example of what the correctness of a meta-criterion like “Don’t Make Things Worse” could in principle consist in?
I’m not looking here for a “reduction” in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell what’s “decision-theoretically normative” in cases like this.
Human brains didn’t evolve to have a primitive “normativity detector” that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like “things brains value as ends”, “things that are useful for various ends”, “things that accurately map states of affairs”...
When I think of other examples of normativity, my sense is that in every case there’s at least one good account of why a human might be able to distinguish “truly” normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):
1. If I discover two alien species who disagree about the truth-value of “carbon atoms have six protons”, I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.
2. If I discover two alien species who disagree about the truth value of “pawns cannot move backwards in chess” or “there are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmetic”, then I can explain the rules of ‘proving things about chess’ or ‘proving things about PA’ as a symbol game, and write down strings of symbols that collectively constitute a ‘proof’ of the statement in question.
I can then assert that if any member of any species plays the relevant ‘proof’ game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.
(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once we’ve accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between “follow through on the proof process” and “just start doodling things or writing random letters down”.)
The above doesn’t answer questions like “do the relevant symbols have Platonic objects as truthmakers or referents?”, or “why do we live in a consistent universe?”, or the like. But the above answer seems sufficient for rejecting any claim that there’s something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödel’s first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.
3. If I discover two alien species who disagree about the truth-value of “suffering is terminally valuable”, then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches what’s encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is “wrong”.
Alternatively, I can make different “wrong” predicates for each species: wronghuman, wrongalien1, wrongalien2, wronghousecat, etc.
This has the disadvantage of maybe making it sound like all these values are on “equal footing” in an internally inconsistent way (“it’s wrong to put undue weight on what’s wronghuman!”, where the first “wrong” is secretly standing in for “wronghuman”), but has the advantage of making it easy to see why the aliens’ disagreement might be important and substantive, while still allowing that aliens’ normative claims can be wrong (because they can be mistaken about their own core values).
The details of how to go from a brain to an encoding of “what’s right” seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, it’s going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.
If a criterion like “Don’t Make Things Worse” deserves a lot of weight, I want to know what that weight is coming from.
If the answer is “I know it has to come from something, but I don’t know what yet”, then that seems like a perfectly fine placeholder answer to me.
If the answer is “This is like the ‘terminal values’ case, in that (I hypothesize) it’s just an ineradicable component of what humans care about”, then that also seems structurally fine, though I’m extremely skeptical of the claim that the “warm glow of feeling causally efficacious” is important enough to outweigh other things of great value in the real world.
If the answer is “I think ‘Don’t Make Things Worse’ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in life”, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.
I mostly agree with this. I think the disagreement between CDT and FDT/UDT advocates is less about definitions, and more about which of these things feels more compelling:
1. On the whole, FDT/UDT ends up with more utility.
(I think this intuition tends to hold more force with people the more emotionally salient “more utility” is to you. E.g., consider a version of Newcomb’s problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your child’s life.)
2. I’m not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where I’ve already observed that the second box is empty), I’m “getting away with something” and getting free utility that the FDT agent would miss out on.
(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that it’s “too late” for one-boxing to get you any more utility in this world.)
There are other considerations too, like how much it matters to you that CDT isn’t self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. It’s fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but it’s still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.
If a more disciplined and well-prepared version of you would have one-boxed, then isn’t there something off about saying that two-boxing is in any sense “correct”? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).
Mind you, if the sentence “CDT is the most rational decision theory” is true in some substantive, non-trivial, non-circular sense, then I’m inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think it’s better to accept that truth than to linguistically conceal it.
But the arguments I’ve seen for “CDT is the most rational decision theory” to date have struck me as either circular, or as reducing to “I know CDT doesn’t get me the most utility, but something about it just feels right”.
It’s fine, I think, if “it just feels right” is meant to be a promissory note for some forthcoming account — a clue that there’s some deeper reason to favor CDT, though we haven’t discovered it yet. As the FDT paper puts it:
On the other hand, if “it just feels right” is meant to be the final word on why “CDT is the most rational decision theory”, then I feel comfortable saying that “rational” is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.
My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.
Superstitious people often believe that it’s possible to directly causally influence things across great distances of time and space. At a glance, FDT’s prescription (“one-box, even though you can’t causally affect whether the box is full”) as well as its account of how and why this works (“you can somehow ‘control’ the properties of abstract objects like ‘decision functions’”) seem weird and spooky in the manner of a superstition.
FDT’s response: if a thing seems spooky, that’s a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failure—in the case of decision theory, some systematic loss of utility that isn’t balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool “magical thinking” starts to ring rather hollow. At that point, it’s necessary to consider the possibility that FDT is counter-intuitive but correct (like Einstein’s “spukhafte Fernwirkung”), rather than magical.
In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:
The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the “deterministic subprocess” view of our decision-making, we would find nothing strange about the idea that it’s sometimes right for this subprocess to do locally incorrect things for the sake of better global results.
E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brain’s decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then there’s nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, it’s just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.
(Note how a lot of these points are more intuitive in CS language. I don’t think it’s a coincidence that people coming from CS were able to improve on academic decision theory’s ideas on these points; I think it’s related to what kinds of stumbling blocks get in the way of thinking in these terms.)
Suppose you initially tell yourself:
Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.
I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that you’re a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDT’s decision seem reasonable in this case.
In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that we’d been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you “getting away with something”, it’s really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).
I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments I’ve seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.
Is the following a roughly accurate re-characterization of the intuition here?
“Suppose that there’s an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesn’t really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, there’s no relevant sense in which the agent should have two-boxed.”
If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent “should have” one-boxed—since in a physically deterministic sense it could not have. Therefore, I think it’s probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other.
In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. So—insofar as we accept this kind of objection from determinism—there seems to be something problematically non-naturalistic about discussing what “would have happened” if we built in one decision procedure or another.
No, I don’t endorse this argument. To simplify the discussion, let’s assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, there’s only one possible outcome; but it would be odd to say that a decision stops being a decision just because it’s determined by something. (What’s the alternative?)
I do endorse “given the predictor’s perfect accuracy, it’s impossible for the P_UDT agent to two-box and come away with $1,001,000”. I also endorse “given the predictor’s perfect accuracy, it’s impossible for the P_CDT agent to two-box and come away with $1,001,000″. Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesn’t mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.
(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this “dominance” argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly don’t think agents “should” try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcomb’s problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)
In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also “could have two-boxed”, but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what they’ll choose, the uncertainty goes away as soon as they see whether the box is full.
No, there’s nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.
The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesn’t mean that we can’t choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either “I think I’ll follow P_FDT from now on” or “I think I’ll follow P_CDT from now on”. Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make “real decisions”, for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.
It isn’t problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/impossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are “live”, are possible, until the very moment we decide.
(See also Decisions are for making bad outcomes inconsistent.)
I think you need to make a clearer distinction here between “outcomes that don’t exist in the universe’s dynamics” (like taking both boxes and receiving $1,001,000) and “outcomes that can’t exist in my branch” (like there not being a bomb in the unlucky case). Because if you’re operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations).
And, to be clear, I do think agents “should” try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents “should” try to achieve outcomes that are already known to have happened from the problem specification including observations.
As an example, if you’re in Parfit’s Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where you’re deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.
+1, I agree with all this.
Suppose that we accept the principle that agents never “should” try to achieve outcomes that are impossible from the problem specification—with one implication being that it’s false that (as R_CDT suggests) agents that see a million dollars in the first box “should” two-box.
This seems to imply that it’s also false that (as R_UDT suggests) an agent that sees that the first box is empty “should” one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you don’t want to say that.
I think I probably still don’t understand your objection here—so I’m not sure this point is actually responsive to it—but I initially have trouble seeing what potential violations of naturalism/determinism R_CDT could be committing that R_UDT would not also be committing.
(Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)
That should be “a one-boxing policy”, right?
Yep, thanks for the catch! Edited to fix.
It seems to me like they’re coming down to saying something like: the “Guaranteed Payoffs Principle” / “Don’t Make Things Worse Principle” is more core to rational action than being self-consistent. Whereas others think self-consistency is more important.
It’s not clear to me that the justification for CDT is more circular than the justification for FDT. Doesn’t it come down to which principles you favor?
Maybe you could say FDT is more elegant. Or maybe that it satisfies more of the intuitive properties we’d hope for from a decision theory (where elegance might be one of those). But I’m not sure that would make the justification less-circular per se.
I guess one way the justification for CDT could be more circular is if the key or only principle that pushes in favor of it over FDT can really just be seen as a restatement of CDT in a way that the principles that push in favor of FDT do not. Is that what you would claim?
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
FDT gets you more utility than CDT. If you value literally anything in life more than you value “which ritual do I use to make my decisions?”, then you should go with FDT over CDT; that’s the core argument.
This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think there’s no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).
The latter argument for CDT isn’t circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
I do think the argument ultimately needs to come down to an intuition about self-effacingness.
The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.
But there’s nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, there’s also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.
So why shouldn’t I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.
More formally, it seems like the argument needs to be something along these lines:
Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Therefore—as an implication of points (3) and (4) -- R_CDT is not true.
Whether you buy the “No Self-Effacement” assumption in Step 4 -- or, alternatively, the countervailing “Don’t Make Things Worse” assumption that supports R_CDT—seems to ultimately be a mattter of intuition. At least, I don’t currently know what else people can appeal to here to resolve the disagreement.
[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesn’t specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I don’t think this ambiguity matters much for the argument.]]
[[SECOND SIDENOTE: I’m using the phrase “self-effacing” rather than “self-contradictory” here, because I think it’s more standard and because “self-contradictory” seems to suggest logical inconsistency.]]
If the thing being argued for is “R_CDT plus P_SONOFCDT”, then that makes sense to me, but is vulnerable to all the arguments I’ve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDT’s “Don’t Make Things Worse” principle.
If the thing being argued for is “R_CDT plus P_FDT”, then I don’t understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over “R_FDT plus P_FDT”? (Indeed, what difference between the two views would be intended here?)
The argument against “R_CDT plus P_SONOFCDT” doesn’t require any mention of self-effacingness; it’s entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.
The argument against “R_CDT plus P_FDT” seems to demand some reference to self-effacingness or inconsistency, or triviality / lack of teeth. But I don’t understand what this view would mean or why anyone would endorse it (and I don’t take you to be endorsing it).
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what “expected utility” means.
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT “gets you less utility” rather than the point that P_SONOFCDT “gets you less utility.” So my comment was aiming to explain why I don’t think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the “No Self-Effacement Principle”). But it sounds like you might agree that this fact doesn’t on its own provide a strong challenge.
In response to the first argument alluded to here: “Gets the most [expected] utility” is ambiguous, as I think we’ve both agreed.
My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. So—if we evaluate the expected utility of a decision to adopt a policy from a casual perspective—it seems to me that P_SONOFCDT “gets the most expected utility.”
If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may “get the most utility” (because policy adoption decisions may be non-causally correlated.)
Apologies if I’m off-base, but it reads to me like you might be suggesting an argument along these lines:
R_CDT says that it is rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize “expected utility” (defined in evidential/subjunctive terms).
Therefore R_CDT is not true.
The natural response to this argument is that it’s not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its “expected utility” defined in causal terms. So someone starting from the position that R_CDT is true obviously won’t accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its “expected utility” defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.
Overall at least this particular simple argument—that R_CDT is false because P_SONOFCDT gets less “expected utility” as defined in evidential/quasi-evidential terms—would seemingly fail to due circularity. But you may have in mind a different argument.
I felt confused by this comment. Doesn’t even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you don’t want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more “rational” decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?
You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit to—P_SONOFCDT—sometimes outputs decisions that definitely make things worse. So “Don’t Make Things Worse” implies that some of the decisions outputted by P_SONOFCDT are irrational.
But I still don’t see what the argument is here unless we’re assuming “No Self-Effacement.” It still seems to me like we have a few initial steps and then a missing piece.
(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the “Don’t Make Things Worse” Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.
The “No Self-Effacement” Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume “No Self-Effacement” in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if we’re not assuming “No Self-Effacement,” then it’s not clear to me how we get there.
Actually, in the context of this particular argument, I suppose we don’t really have the option of assuming that “No Self-Effacement” is true—because this assumption would be inconsistent with the earlier assumption that “Don’t Make Things Worse” is true. So I’m not sure it’s actually possible to make this argument schema work in any case.
There may be a pretty different argument here, which you have in mind. I at least don’t see it yet though.
Perhaps the argument is something like:
“Don’t make things worse” (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_′s, and not about P_′s
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldn’t get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
Here are two logically inconsistent principles that could be true:
Don’t Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since they’re mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.
One could argue that R_CDT sympathists don’t actually have much stronger intuitions regarding the first principle than the second—i.e. that their intuitions aren’t actually very “targeted” on the first one—but I don’t think that would be right. At least, it’s not right in my case.
A more viable strategy might be to argue for something like a meta-principle:
The ‘Don’t Make Things Worse’ Meta-Principle: If you find “Don’t Make Things Worse” strongly intuitive, then you should also find “Don’t Commit to a Policy That In the Future Will Sometimes Make Things Worse” just about as intuitive.
If the meta-principle were true, then I guess this would sort of imply that people’s intuitions in favor of “Don’t Make Things Worse” should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.
But I don’t see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of “Don’t Make Things Worse” :)
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.
Bomb Case: Omega puts a million dollars in a transparent box if he predicts you’ll open it. He puts a bomb in the transparent box if he predicts you won’t open it. He’s only wrong about one in a trillion times.
Now suppose you enter the room and see that there’s a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and don’t open the box, then nothing bad will happen to you. You’ll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.
Intuitively, this decision strikes me as deeply irrational. You’re intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that you’re flagrantly violating the “Don’t Make Things Worse” principle.
Now, let’s step back a time step. Suppose you know that you’re sort of person who would refuse to kill yourself by detonating the bomb. You might decide that—since Omega is such an accurate predictor—it’s worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision you’re making now is nonetheless making things better in expectation.
This decision strikes me as pretty intuitively rational. You’re violating the second principle—the “Don’t Commit to a Policy...” Principle—but this violation just doesn’t seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.
(This obviously just a description of my own intuitions, as they stand, though.)
By triggering the bomb, you’re making things worse from your current perspective, but making things better from the perspective of earlier you. Doesn’t that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you don’t. After updating, you think you’re either a simulation within Omega’s prediction so your action has no effect on yourself or you’re in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/inconsistency.
Giving the human tendency to change our (UDT-)utility functions by updating, it’s not clear what to do (or what is right), and I think this reduces UDT’s intuitive appeal and makes it less of a slam-dunk over CDT/EDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isn’t adequately explained in MIRI’s decision theory papers.)
I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that that’s right in your case.)
But I feel like the second doesn’t quite capture what I had in mind regarding the DMTW intuition applied to P_′s.
Consider an alternate version:
Or alternatively:
It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the ”… then taking that decision is not rational.” version is. And it’s only after you’ve considered prisoners’ dilemmas or Newcomb’s paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from what’s rational in the moment.
(But maybe others would disagree on how intuitive these versions are.)
EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then it’s not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.
Of course, this argument is moot if it’s true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe that’s the case? But I think it’s a little more ambiguous with the ”… is not good policy” or “a rational person would not...” versions than with the “Don’t commit to a policy...” version.
EDIT2: Does what I’m trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
Just as a quick sidenote:
I’ve been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.
If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.
The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they aren’t). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.
[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that you’re building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisoner’s dilemma (e.g. some version of P_FDT). But if R_CDT is true and you’ve just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisoner’s dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative view—which I’ve alluded to above—is that the various proposed criteria of rightness don’t in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
Do you mean
It seems to better fit the pattern of the example just prior.
This is similar to how you described it here:
This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if it’s normative, it can be either an algorithm/procedure that’s being recommended, or a criterion of rightness like “a decision is rational iff taking it would cause the largest expected increase in value” (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are “normative” or “endorsed”).
Some of your discussion above seems to be focusing on the “algorithmic?” dimension, while other parts seem focused on “normative?”. I’ll say more about “normative?” here.
The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think they’re pretty concrete and crisply defined. It’s harder for me to accidentally switch topics or bundle two different concepts together when talking about “trying to optimize vs. optimizing as a side-effect”, “directly optimizing vs. optimizing via heuristics”, “initially optimizing vs. self-modifying to optimize”, or “function vs. algorithm”.
In contrast, I think “normative” and “rational” can mean pretty different things in different contexts, it’s easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of what’s at stake in the discussion.
E.g., “normative” is often used in the context of human terminal values, and it’s in this context that statements like this ring obviously true:
If we’re treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:
Decision theory isn’t normative.
Decision theory is normative in the way that “murder is bad” or “improving aggregate welfare is good” is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).
Probability theory has obvious normative force in the context of reasoning and decision-making, but it’s not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.
A lot of the examples you’ve cited are theories from moral philosophy about what’s terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; it’s not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.
You could indeed define an agent that terminally values making CDT-style decisions, but I don’t think most proponents of CDT or EDT would claim that their disagreement with UDT/FDT comes down to a values disagreement like that. Rather, they’d claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about who’s mistaken.)
In the twin prisoner’s dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).
I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/or uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.
[[Disclaimer: I’m not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]]
I think the kind of decision theory that philosophers tend to work on is typically explicitly described as “normative.” (For example, the SEP article on decision theory is about “normative decision theory.”) So when I’m talking about “academic decision theories” or “proposed criteria of rightness” I’m talking about normative theories. When I use the word “rational” I’m also referring to a normative property.
I don’t think there’s any very standard definition of what it means for something to be normative, maybe because it’s often treated as something pretty close to a primitive concept, but a partial account is that a “normative theory” is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one).
Some normative theories concern “ends.” These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. I’m not sure what the best terminology is, and think this choice is probably relatively non-standard, but let’s label these “moral theories.”
Some normative theories, including “decision theories,” concern “means.” These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/implications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alternative theories) say different things. Typical versions of CDT and EDT flesh out expected utility theory in different ways to specify what the relevant measure of “expected fulfillment” is.
Moral theory and normative decision theory seem to me to have pretty much the same status. They are both bodies of theory that bear on what people should do. On some views, the division between them is more a matter of analytic convenience than anything else. For example, David Enoch, a prominent meta-ethicist, writes: “In fact, I think that for most purposes [the line between the moral and the non-moral] is not a line worth worrying about. The distinction within the normative between the moral and the non-moral seems to me to be shallow compared to the distinction between the normative and the non-normative” (Taking Morality Seriously, 86).
One way to think of moral theories and normative decision theories is as two components that fit together to form more fully specified theories about what people should do. Moral theories describe the ends people should pursue; given these ends, decision theories then describe what actions people should take when in states of uncertainty. To illustrate, two examples of more complete normative theories that combine moral and decision-theoretic components would be: “You should take whatever action would in expectation cause the largest increase in the fulfillment of your preferences” and “You should take whatever action would, if you took it, lead you to anticipate the largest expected amount of future happiness in the world.” The first is subjectivism combined with CDT, while the second is total view hedonistic utilitarianism combined with EDT.
(On this conception, a moral theory is not a description of “an arbitrary terminal value of human beings.” Decision theory here also is not “the study of which decision-making methods humans happen to terminally prefer to employ.” These are both theories are about what people should do, rather than theories about about what people’s preferences are.)
Normativity is obviously pretty often regarded as a spooky or insufficiently explained thing. So a plausible position is normative anti-realism: It might be the case that no normative claims are true, either because they’re all false or because they’re not even well-formed enough to take on truth values. If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesn’t really have an answer.
If I’m someone with a twin and I’m implementing P_CDT, I still don’t think I will choose to modify myself to cooperate in twin prisoner’s dilemmas. The reason is that modifying myself won’t cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.
(The fact P_CDT agents won’t modify themselves to cooperate with their twins could of course be interpretted as a mark against R_CDT.)
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!
Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeks’ astronomical observations, and some of their theories and predictive tools, were still true and useful.
I think that terms like “normative” and “rational” are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauser’s pluralistic moral reductionism).
I would say that (1) some philosophers use “rational” in a very human-centric way, which is fine as long as it’s done consistently; (2) others have a much more thin conception of “rational”, such as ‘tending to maximize utility’; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of “rationality”, but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.
I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcomb’s problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer who’s accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)
I’d expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/3 than type-1. (And in this case, even type-1 analyses of “rationality” can be helpful for bringing various widespread background assumptions to light.)
This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDT’s prescriptions from now on.
Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasn’t happened yet.
(And yes, by the time you actually end up in the prisoner’s dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDT’s prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDT’s future prescriptions.)
(If it helps, don’t think of the copy of you as “you”: just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)
Thank you for taking the time to respond as well! :)
I’m not positive I understand what (1) and (3) are referring to here, but I would say that there’s also at least a fourth way that philosophers often use the word “rational” (which is also the main way I use the word “rational.”) This is to refer to an irreducibly normative concept.
The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. “reduced”). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible word—it requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial “basis” of primitive concepts that we use to describe the rest of the concepts.
Some examples of concepts that are arguably irreducible are “truth,” “set,” “property,” “physical,” “existance,” and “point.” Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further.
To focus on the example of “truth,” some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that “X is true” what they really mean or should mean is “I personally believe X” or “believing X is good for you.” But I think these suggested reductions pretty obviously don’t entirely capture what people mean when they say “X is true.” The phrase “X is true” also has an important meaning that is not amenable to this sort of reduction.
[[EDIT: “Truth” may be a bad example, since it’s relatively controversial and since I’m pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argument that some concepts will necessarily be irreducible, the particular choice of example used here isn’t essential to the overall point.]]
Some philosophers also employ normative concepts that they say cannot be reduced in terms of non-normative (e.g. psychological) properties. These concepts are said to be irreducibly normative.
For example, here is Parfit on the concept of a normative reason (OWM, p. 1):
When someone says that a concept they are using is irreducible, this is obviously some reason for suspicion. A natural suspicion is that the real explanation for why they can’t give a useful description is that the concept is seriously muddled or fails to grip onto anything in the real world. For example, whether this is fair or not, I have this sort of suspicion about the concept of “dao” in daoist philosophy.
But, again, it will necessarily be the case that some useful and valid concepts are irreducible. So we should sometimes take evocations of irreducible concepts seriously. A concept that is mostly undefined is not always problematically “underdefined.”
When I talk about “normative anti-realism,” I mostly have in mind the position that claims evoking irreducably normative concepts are never true (either because these claims are all false or because they don’t even have truth values). For example: Insofar as the word “should” is being used in an irreducibly normative sense, there is nothing that anyone “should” do.
[[Worth noting, though: The term “normative realism” is sometimes given a broader definition than the one I’ve sketched here. In particular, it often also includes a position known as “analytic naturalist realism” that denies the relevance of irreducibly normative concepts. I personally feel I understand this position less well and I think sometimes waffle between using the broader and narrower definition of “normative realism.” I also more generally want to stress that not everyone who makes claims about “criterion of rightness” or employs other seemingly normative language is actually a normative realist in the narrow or even broad sense; what I’m doing here is just sketching one common especially salient perspective.]]
One motivation for evoking irreducibly normative concepts is the observation that—in the context of certain discussions—it’s not obvious that there’s any close-to-sensible way to reduce the seemingly normative concepts that are being used.
For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of “a rational choice” to the concept of “a winning choice” (or, in line with the type-2 conception you mention, a “utility-maximizing choice”). It seems difficult to make sense of a lot of basic claims about rationality if we use this reduction—and other obvious alternative reductions don’t seem to fair much better. To mostly quote from a comment I made elsewhere:
FN15 in my post on normative realism elaborates on this point.
At the same time, though, I do think there are also really good and hard-to-counter epistemological objections to the existance of irreducibly normative properties (e.g. the objection described in this paper). You might also find the difficulty of reducing normative concepts a lot less obvious-seeming or problematic than I do. You might think, for example, that the difficulty of reducing “rationality” is less like the difficulty of reducing “truth” (which IMO mainly reflects the fact that truth is an important primitive concept) and more like the difficulty of defining the word “soup” in a way that perfectly matches our intuitive judgments about what counts as “soup” (which IMO mainly reflects the fact that “soup” is a high-dimensional concept). So I definitely don’t want to say normative realism is obviously or even probably right.
I mainly just want to communicate the sort of thing that I think a decent chunk of philosophers have in mind when they talk about a “rational decision” or a “criterion of rightness.” Although, of course, philosophy being philosophy, plenty of people do of course have in mind plenty of different things.
So, as an experiment, I’m going to be a very obstinate reductionist in this comment. I’ll insist that a lot of these hard-seeming concepts aren’t so hard.
Many of them are complicated, in the fashion of “knowledge”—they admit an endless variety of edge cases and exceptions—but these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where there’s a simple core we can point to, that core generally isn’t mysterious.
It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it “isn’t mysterious”, I mean it’s pretty easy to see how the concept can crop up in human thought even if it doesn’t belong on the short list of deep fundamental cosmic structure terms.
Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like “love,” “knowledge,” “France”) , or it’s not (in which case it goes in bucket 2).
Picking on the concept here that seems like the odd one out to me: I feel confident that there isn’t a cosmic law (of nature, or of metaphysics, etc.) that includes “truth” as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like “intentionality/reference”, “assertion”, or “state of affairs”, though the former two strike me as easy to explain in simple physical terms.
Mundane empirical “truth” seems completely straightforward. Then there’s the truth of sentences like “Frodo is a hobbit”, “2+2=4”, “I could have been the president”, “Hamburgers are more delicious than battery acid”… Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word “true” to refer to all these different statements is pre-philosophical, and doesn’t reflect anything deeper than that “each of these sentences at least superficially looks like it’s asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic community”.
I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think there’s also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of “Frodo is a hobbit” doesn’t mean you’re learning something deep about the universe (or even something particularly deep about human cognition) in the process.
Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, “To what extent is observation o a [WORD] to believe hypothesis h?” When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, “To what extent is P(h)=0.7 a [WORD] to choose action a?” As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.
When two robots meet, we can moreover add that they negotiate a joint “compromise” goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using “[WORD]” where an action is being evaluated relative to the joint goal, not just the robot’s original goal.
Thus when Robot A tells Robot B “I assign probability 90% to ‘it’s noon’, which is [WORD] to have lunch”, A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and B’s joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)
If you decide to relabel “[WORD]” as “reason”, I claim that this captures a decent chunk of how people use the phrase “a reason”. “Reason” is a suitcase word, but that doesn’t mean there are no similarities between e.g. “data my goals endorse using to adjust the probability of a given hypothesis” and “probabilities-of-hypotheses my goals endorse using to select an action”), or that the similarity is mysterious and ineffable.
(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)
That essay isn’t trying to “reduce” the term “rationality” in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of “what the right thing to do is”, they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).
Therefore: if you’re going to use words like “rationality”, make sure that the words in question won’t cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you aren’t using “rationality” in a safe “nailed-to-utility” way, make sure that you’re willing to turn on a time and stop being “rational” the second your conception of rationality starts telling you to throw away value.
“Rationality” is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just “(systematized) winning” but (as noted in the essay) “Bayesian reasoning”, or in Rationality: Appreciating Cognitive Algorithms, “cognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievement”. In philosophy, the list is a lot longer.
The common denominator seems to largely be “something something reasoning / deliberation” plus (as you note) “something something normativity / desirability / recommendedness / requiredness”.
The idea of “normativity” doesn’t currently seem that mysterious to me either, though you’re welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:
Goals/preferences/values, especially overridingly strong ones.
Encouraged, endorsed, mandated, or praised conduct.
Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I don’t think it’s particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.
I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.
(I don’t really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort I’ve just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments you’ve just made cover some reasons to suspect this.)
The key point is just that when these philosophers say that “Action X is rational,” they are explicitly reporting that they do not mean “Action X suits my terminal preferences” or “Action X would be taken by an agent following a policy that maximizes lifetime utility” or any other such reduction.
I think that when people are very insistent that they don’t mean something by their statements, it makes sense to believe them. This implies that the question they are discussing—“What are the necessary and sufficient conditions that make a decision rational?”—is distinct from questions like “What decision would an agent that tends to win take?” or “What decision procedure suits my terminal preferences?”
It may be the case that the question they are asking is confused or insensible—because any sensible question would be reducible—but it’s in any case different. So I think it’s a mistake to interpret at least these philosophers’ discussions of “decisions theories” or “criteria of rightness” as though they were discussions of things like terminal preferences or winning strategies. And it doesn’t seem to me like the answer to the question they’re asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.
[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. It’s less clear to me how related or unrelated the thing they’re talking about is to issues of interest to MIRI. I think that the conception of rationality I’m discussing here mainly just presents an especially clear case.]]
Just on this point: I think you’re right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context:
Decisions
Decision procedures
The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility)
The set of properties that makes a decision rational (“criterion of rightness”)
A claim about what the criterion of rightness is (“normative decision theory”)
The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness)
(4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity.
My current-although-not-firmly-held view is also that (6) probably isn’t very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.
Just want to note that I found the R_ vs P_ distinction to be helpful.
I think using those terms might be useful for getting at the core of the disagreement.