Hilary Greaves: Then as many of your listeners will know, in the space of AI research, people have been throwing around terms like âfunctional decision theoryâ and âtimeless decision theoryâ and âupdateless decision theoryâ. I think itâs a lot less clear exactly what these putative alternatives are supposed to be. The literature on those kinds of decision theories hasnât been written up with the level of precision and rigor that characterizes the discussion of causal and evidential decision theory. So itâs a little bit unclear, at least to my likes, whether thereâs genuinely a competitor to decision theory on the table there, or just some intriguing ideas that might one day in the future lead to a rigorous alternative.
I understand from that that there is little engagement of MIRI with the academia. What is more troubling for me is that it seems that the cases for the major decision theories are looked upon with skepticism from academic experts.
Do you think that is really the case?
How do you respond to that?
It would personally feel much better if I knew that there are some academic decision theorists who are exited about your research, or a compelling explanation of a systemic failure that explains this which can be applied to MIRIâs work specifically.
[The transition to non-disclosed research happend after the interview]
Iâm not really sure whatâs going on here. When I read critiques of MIRI-style decision theories (eg from Will or from Wolfgang Schwartz), I feel very unpersuaded by them. This leaves me in a situation where my inside views disagree with the views of the most obvious class of experts, which is always tricky.
When I read those criticisms by Will MacAskill and Wolfgang Schwartz, I feel like I understand their criticisms and find them unpersuasive, as opposed to not understanding their criticisms. Also, I feel like they donât understand some of the arguments and motivations for FDT. I feel a lot better disagreeing with experts when I think I understand their arguments and when I think I can see particular mistakes that theyâre making. (Itâs not obvious that this is the right epistemic strategy, for reasons well articulated by Gregory Lewis here.)
Paulâs comments on this resolved some of my concerns here. He thinks that the disagreement is mostly about what questions decision theory should be answering. He thinks that the updateless decision theories are obviously more suitable to building AI than eg CDT or EDT.
I think itâs plausible that Paul is being overly charitable to decision theorists; Iâd love to hear whether skeptics of updateless decision theories actually agree that you shouldnât build a CDT agent. (Also, when you ask a CDT agent what kind of decision theory it wants to program into an AI, you get a class of decision theory called âSon of CDTâ, which isnât UDT.)
I think thereâs a systematic pattern where philosophers end up being pretty ineffective at answering the philosophy questions that I care about (based eg on my experience seeing the EA community punch so far above its weight thinking about ethics), and so Iâm not very surprised if it turns out that in this specific case, the philosophy community has priorities that donât match mine.
I think thereâs also a pattern where philosophers have some basic disagreements with me, eg about functionalism and how much math intuitions should feed into our philosophical intuitions. This decision theory disagreement reminds me of that disagreement.
Schwartz has a couple of complaints that the FDT paper doesnât engage properly with the mainstream philosophy literature (eg the Justin Fisher and the David Gauthier papers). My guess is that these complaints are completely legitimate.
On his blog, Scott Aaronson does a good job of describing what I think might be a key difference here:
But the basic split between Many-Worlds and Copenhagen (or better: between Many-Worlds and âshut-up-and-calculateâ /â âQM needs no interpretationâ /â etc.), I regard as coming from two fundamentally different conceptions of what a scientific theory is supposed to do for you. Is it supposed to posit an objective state for the universe, or be only a tool that you use to organize your experiences?
Also, are the ultimate equations that govern the universe âreal,â while tables and chairs are âunrealâ (in the sense of being no more than fuzzy approximate descriptions of certain solutions to the equations)? Or are the tables and chairs âreal,â while the equations are âunrealâ (in the sense of being tools invented by humans to predict the behavior of tables and chairs and whatever else, while extraterrestrials might use other tools)? Which level of reality do you care about /â want to load with positive affect, and which level do you want to denigrate?
My guess is that the factor which explains academic unenthusiasm for our work is that decision theorists are more of the âtables and chairs are realâ school than the âequations are realâ schoolâthey arenât as oriented by the question of âhow do I write down a decision theory which would have good outcomes if I created an intelligent agent which used itâ, and they donât have as much of an intuition as I do that that kind of question is fundamentally simple and should have a lot of weight in your choices about how to think about reality.
---
I am really very curious to hear what people (eg edoarad) think of this answer.
I think itâs plausible that Paul is being overly charitable to decision theorists; Iâd love to hear whether skeptics of updateless decision theories actually agree that you shouldnât build a CDT agent.
FWIW, I could probably be described as a âskepticâ of updateless decision theories; Iâm pretty sympathetic to CDT. But I also donât think we should build AI systems that consistently take the actions recommended by CDT. I know at least a few other people who favor CDT, but again (although small sample size) I donât think any of them advocate for designing AI systems that consistently act in accordance with CDT.
I think the main thing thatâs going on here is that academic decision theorists are primarily interested in normative principles. Theyâre mostly asking the question: âWhat criterion determines whether or not a decision is ârationalâ?â For example, standard CDT claims that an action is rational only if itâs the action that can be expected to cause the largest increase in value.
On the other hand, AI safety researchers seem to be mainly interested in a different question: âWhat sort of algorithm would it be rational for us to build into an AI system?â The first question doesnât seem very relevant to the second one, since the different criteria of rationality proposed by academic decision theorists converge in most cases. For example: No matter whether CDT, EDT, or UDT is correct, it will not typically be rational to build a two-boxing AI system. It seems to me, then, that itâs probably not very pressing for the AI safety community to think about the first question or engage with the academic decision theory literature.
At the same time, though, AI safety writing on decision theory sometimes seems to ignore (or implicitly deny?) the distinction between these two questions. For example: The FDT paper seems to be pitched at philosophers and has an abstract that frames the paper as an exploration of ânormative principles.â I think this understandably leads philosophers to interpret FDT as an attempt to answer the first question and to criticize it on those grounds.
they arenât as oriented by the question of âhow do I write down a decision theory which would have good outcomes if I created an intelligent agent which used itâ
I would go further and say that (so far as I understand the field) most academic decisions theorists arenât at all oriented by this question. I think the question theyâre asking is again mostly independent. Iâm also not sure it would even make sense to talk about âusingâ a âdecision theoryâ in this context, insofar as weâre conceptualizing decision theories the way most academic decision theorists do (as normative principles). Talking about âusingâ CDT in this context is sort of like talking about âusingâ deontology.
[[EDIT: See also this short post for a better description of the distinction between a âcriterion of rightnessâ and a âdecision procedure.â Another way to express my impression of whatâs going on is that academic decision theorists are typically talking about critera of rightness and AI safety decision theorists are typically (but not always) talking about decision procedures.]]
The comments here have been very ecumenical, but Iâd like to propose a different account of the philosophy/âAI divide on decision theory:
1. âWhat makes a decision âgoodâ if the decision happens inside an AI?â and âWhat makes a decision âgoodâ if the decision happens inside a brain?â arenât orthogonal questions, or even all that different; theyâre two different ways of posing the same question.
MIRIâs AI work is properly thought of as part of the âsuccess-first decision theoryâ approach in academic decision theory, described by Greene (2018) (who also cites past proponents of this way of doing decision theory):
[...] Consider a theory that allows the agents who employ it to end up rich in worlds containing both classic and transparent Newcomb Problems. This type of theory is motivated by the desire to draw a tighter connection between rationality and success, rather than to support any particular account of expected utility. We might refer to this type of theory as a âsuccess-firstâ decision theory.
[...] The desire to create a closer connection between rationality and success than that offered by standard decision theory has inspired several success-first decision theories over the past three decades, including those of Gauthier (1986), McClennen (1990), and Meacham (2010), as well as an influential account of the rationality of intention formation and retention in the work of Bratman (1999). McClennen (1990: 118) writes: âThis is a brief for rationality as a positive capacity, not a liabilityâas it must be on the standard account.â Meacham (2010: 56) offers the plausible principle, âIf we expect the agents who employ one decision making theory to generally be richer than the agents who employ some other decision making theory, this seems to be a prima facie reason to favor the first theory over the second.â And Gauthier (1986: 182â3) proposes that âa [decision-making] disposition is rational if and only if an actor holding it can expect his choices to yield no less utility than the choices he would make were he to hold any alternative disposition.â In slogan form, Gauthier (1986: 187) calls the idea âutility-maximization at the level of dispositions,â Meacham (2010: 68â9) a âcohesiveâ decision theory, McClennen (1990: 6â13) a form of âpragmatism,â and Bratman (1999: 66) a âbroadly consequentialist justificationâ of rational norms.
[...] Accordingly, the decision theoristâs job is like that of an engineer in inventing decision theories, and like that of a scientist in testing their efficacy. A decision theorist attempts to discover decision theories (or decision ârules,â âalgorithms,â or âprocessesâ) and determine their efficacy, under certain idealizing conditions, in bringing about what is of ultimate value.
Someone who holds this view might be called a methodological hypernaturalist, who recommends an experimental approach to decision theory. On this view, the decision theorist is a scientist of a special sort, but their goal should be broadly continuous with that of scientific research. The goal of determining efficacy in bringing about value, for example, is like that of a pharmaceutical scientist attempting to discover the efficacy of medications in treating disease.
For game theory, Thomas Schelling (1960) was a proponent of this view. The experimental approach is similar to what Schelling meant when he called for âa reorientation of game theoryâ in Part 2 of A Strategy of Conflict. Schelling argues that a tendency to focus on first principles, rather than upshots, makes game-theoretic theorizing shockingly blind to rational strategies in coordination problems.
The FDT paper does a poor job of contextualizing itself because it was written by AI researchers who are less well-versed with the philosophical literature.
MIRIâs work is both advocating a particular solution to the question âwhat kind of decision theory satisfies the âsuccessâ criterion?â, and lending some additional support to the claim that âsuccess-firstâ is a coherent and reasonable criterion for decision theorists to orient towards. (In a world without ideas like UDT, it was harder to argue that we should try to reduce decision theory to âwhat decision-making approach yields the best utility?â, since neither CDT nor EDT strictly outperforms the other; whereas thereâs a strong case that UDT does strictly outperform both CDT and EDT, to the extent itâs possible for any decision theory to strictly outperform another; though there may be even-better approaches.)
You can go with Paul and say that a lot of these distinctions are semantic rather than substantiveâthat there isnât a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether theyâre successful, vs. some other criterion. But dissolving contentious arguments and showing why theyâre merely verbal is itself a hallmark of analytic philosophy, so this doesnât do anything to make me think that these issues arenât the proper province of academic decision theory.
2. Rather than operating in separate magisteria, people like Wei Dai are making contrary claims about how humans should make decisions. This is easiest to see in contexts where a future technology comes along: if whole-brain emulation were developed tomorrow and it was suddenly trivial to put CDT proponents in literal twin prisonerâs dilemmas, the CDT recommendation to defect (one-box, etc.) suddenly makes a very obvious and real difference.
I claim (as someone who thinks UDT/âFDT is correct) that the reason it tends to be helpful to think about advanced technologies is that it draws out the violations of naturalism that are often implicit in how we talk about human reasoning. Our native way of thinking about concepts like âcontrol,â âchoice,â and âcounterfactualâ tends to be confused, and bringing in things like predictors and copies of our reasoning draws out those confusions in much the same way that sci-fi thought experiments and the development of new technologies have repeatedly helped clarify confused thinking in philosophy of consciousness, philosophy of personal identity, philosophy of computation, etc.
3. Quoting Paul:
Most causal decision theorists would agree that if they had the power to stop doing the right thing, they should stop taking actions which are right. They should instead be the kind of person that you want to be.
And so there, again, I agree it has implications, but I donât think itâs a question of disagreement about truth. Itâs more a question of, like: youâre actually making some cognitive decisions. How do you reason? How do you conceptualize what youâre doing?â
I would argue that most philosophers who feel âtrapped by rationalityâ or âunable to stop doing whatâs âright,â even though they know they âshould,ââ could in fact escape the trap if they saw the flaws in whatever reasoning process led them to their current idea of ârationalityâ in the first place. I think a lot of people are reasoning their way into making worse decisions (at least in the future/âhypothetical scenarios noted above, though I would be very surprised if correct decision-theoretic views had literally no implications for everyday life today) due to object-level misconceptions about the prescriptions and flaws of different decision theories.
And all of this strikes me as very much the bread and butter of analytic philosophy. Philosophers unpack and critique the implicit assumptions in different ways of modeling the world (e.g., âof course I can âcontrolâ physical outcomes but canât âcontrolâ mathematical factsâ, or âof course I can just immediately tell that Iâm in the âreal worldâ; a simulation of me isnât me, or wouldnât be conscious, etc.â). I think MIRI just isnât very good at dialoguing with philosophers, and has had too many competing priorities to put the amount of effort into a scholarly dialogue that I wish were being made.
4. There will obviously be innumerable practical differences between the first AGI systems and human decision-makers. However, putting a huge amount of philosophical weight on this distinction will tend to violate naturalism: ceteris paribus, changing whether you run a cognitive process in carbon or in silicon doesnât change whether the process is doing the right thing or working correctly.
E.g., the rules of arithmetic are the same for humans and calculators, even though we donât use identical algorithms to answer particular questions. Humans tend to correctly treat calculators naturalistically: we often think of them as an extension of our own brains and reasoning, we freely switch back and forth between running a needed computation in our own brain vs. in a machine, etc. Running a decision-making algorithm in your brain vs. in an AI shouldnât be fundamentally different, I claim.
5. For similar reasons, a naturalistic way of thinking about the task âdelegating a decision-making process to a reasoner outside your own brainâ will itself not draw a deep philosophical distinction between âa human building an AI to solve a problemâ and âan AI building a second AI to solve a problemâ or for that matter âan agent learning over time and refining its own reasoning process so it can âdelegateâ to its future selfâ.
There will obviously be practical differences, but there will also be practical differences between two different AI designs. We donât assume that switching to a different design within AI means that the background rules of decision theory (or arithmetic, etc.) go out the window.
(Another way of thinking about this is that the distinction between ânaturalâ and âartificialâ intelligence is primarily a practical and historical one, not one that rests on a deep truth of computer science or rational agency; a more naturalistic approach would think of humans more as a weird special case of the extremely heterogeneous space of â(A)Iâ designs.)
âWhat makes a decision âgoodâ if the decision happens inside an AI?â and âWhat makes a decision âgoodâ if the decision happens inside a brain?â arenât orthogonal questions, or even all that different; theyâre two different ways of posing the same question.
I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.
Hereâs another go:
Letâs suppose that some decisions are rational and others arenât. We can then ask: What is it that makes a decision rational? What are the necessary and/âor sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase âdecision theoryâ in this context typically refers to a claim about necessary and/âor sufficient conditions for a decision being rational. To use different jargon, in this context a âdecision theoryâ refers to a proposed âcriterion of rightness.â
When philosophers talk about âCDT,â for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, âCDTâ is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambiguity, letâs label this claim R_CDT.
We can also talk about âdecision procedures.â A decision procedure is just a process or algorithm that an agent follows when making decisions.
For each proposed criterion of rightness, itâs possible to define a decision procedure that only outputs decisions that fulfill the criterion. For example, we can define P_CDT as a decision procedure that involves only taking actions that R_CDT claims are rational.
My understanding is that when philosophers talk about âCDT,â they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true donât generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves. R_CDT claims that we should do whatever will have the best effectsâand, in many cases, building agents that follow a decision procedure other than P_CDT is likely to have the best effects. More generally: Most proposed criteria of rightness imply that it can be rational to build agents that sometimes behave irrationally.
MIRIâs AI work is properly thought of as part of the âsuccess-first decision theoryâ approach in academic decision theory.
One possible criterion of rightness, which Iâll call R_UDT, is something like this: An action is rational only if it would have been chosen by whatever decision procedure would have produced the most expected value if consistently followed over an agentâs lifetime. For example, this criterion of rightness says that it is rational to one-box in the transparent Newcomb scenario because agents who consistently follow one-boxing policies tend to do better over their lifetimes.
I could be wrong, but I associate the âsuccess-first approachâ with something like the claim that R_UDT is true. This would definitely constitute a really interesting and significant divergence from mainstream opinion within academic decision theory. Academic decision theorists should care a lot about whether or not itâs true.
But Iâm also not sure if it matters very much, practically, whether R_UDT or R_CDT is true. Itâs not obvious to me that they recommend building different kinds of decision procedures into AI systems. For example, both seem to recommend building AI systems that would one-box in the transparent Newcomb scenario.
You can go with Paul and say that a lot of these distinctions are semantic rather than substantiveâthat there isnât a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether theyâre successful, vs. some other criterion.
I disagree that any of the distinctions here are purely semantic. But one could argue that normative anti-realism is true. In this case, there wouldnât really be any such thing as the criterion of rightness for decisions. Neither R_CDT nor R_UDT nor any other proposed criterion would be âcorrect.â
In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.
[[EDIT: Note that Will also emphasizes the importance of the criterion-of-rightness vs. decision-procedure distinction in his critique of the FDT paper: â[T]heyâre [most often] asking what the best decision procedure is, rather than what the best criterion of rightness is⊠But, if thatâs whatâs going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So itâs odd to have a whole paper comparing them side-by-side as if they are rivals.â]]
I agree that these three distinctions are important:
âPicking policies based on whether they satisfy a criterion Xâ vs. âPicking policies that happen to satisfy a criterion Xâ. (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
âTrying to follow a decision rule Y âdirectlyâ or âon the object levelââ vs. âTrying to follow a decision rule Y by following some other decision rule Z that you think satisfies Yâ. (E.g., trying to naĂŻvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits youâve come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
âA decision rule that prescribes outputting some action or policy and doesnât care how you do itâ vs. âA decision rule that prescribes following a particular set of cognitive steps that will then output some action or policyâ. (E.g., a rule that says âmaximize the aggregate welfare of moral patientsâ vs. a specific mental algorithm intended to achieve that end.)
The first distinction above seems less relevant here, since weâre mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly âtrying to do whatâs rightâ.
As a side-note, I do want to emphasize that from the MIRI clusterâs perspective, itâs fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the systemâs alignment-relevant properties arenât obscured and the system ends up safe and reliable).
The main reason to work on decision theory in AI alignment has never been âWhat if people donât make AI âdecision-theoreticâ enough?â or âWhat if people mistakenly think CDT is correct and so build CDT into their AI system?â The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that weâve even been misunderstanding basic things at the level of âdecision-theoretic criterion of rightnessâ.
Itâs not that I want decision theorists to try to build AI systems (even notional ones). Itâs that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. Thatâs part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
The second distinction (âfollowing a rule âdirectlyâ vs. following it by adopting a sub-rule or via self-modificationâ) seems more relevant. You write:
My understanding is that when philosophers talk about âCDT,â they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true donât generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves.
Far from being a distinction proponents of UDT/âFDT neglect, this is one of the main grounds on which UDT/âFDT proponents criticize CDT (from within the âsuccess-firstâ tradition). This is because agents that are reflectively inconsistent in the manner of CDTâones that take actions they know theyâll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.
A human following CDT should endorse âstop following CDT,â since CDT isnât self-endorsing. Itâs not even that they should endorse âkeep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT endsâ; they need to completely abandon CDT even at the meta-level of âwhat sort of decision rule should I follow?â and modify themselves into purely following an entirely new decision rule, or else theyâll continue to perform poorly by CDTâs lights.
The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, âson-of-CDTâ, is roughly:
Have whatever disposition-to-act gets the most utility, unless Iâm in future situations like âa twin prisonerâs dilemma against a perfect copy of my future self where the copy was forked from me before I started following this ruleâ, in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisonerâs dilemmas with that were copied from my brain after I started following this rule).
The fact that CDT doesnât endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories donât), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.
But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in âToward Idealized Decision Theoryâ, where âCDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action.â
In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agentsâ due to events that happened after she turned 20 (such as âthe summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theoryâ). But sheâll refuse to coordinate for reasons like âwe hung out a lot the summer before my 20th birthdayâ, âwe spent our whole childhoods and teen years living together and learning from the same teachersâ, and âwe all have similar decision-making faculties due to being members of the same speciesâ. Thereâs no principled reason to draw this temporal distinction; itâs just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.
Regarding the third distinction (âprescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of outputâ), Iâd say that itâs primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called âFunctional Decision Theoryâ and not (e.g.) âAlgorithmic Decision Theoryâ: the focus is explicitly on âwhat outcomes do you produce?â, not on how you produce them.
(Thus, an FDT agent can cooperate with another agent whenever the latter agentâs input-output relations match FDTâs prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)
The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the âunderstand what good reasoning isâ project (e.g., logical uncertainty) and how they relate to decision theory.
I agree that these three distinctions are important
âPicking policies based on whether they satisfy a criterion Xâ vs. âPicking policies that happen to satisfy a criterion Xâ. (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
âTrying to follow a decision rule Y âdirectlyâ or âon the object levelââ vs. âTrying to follow a decision rule Y by following some other decision rule Z that you think satisfies Yâ. (E.g., trying to naĂŻvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits youâve come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
âA decision rule that prescribes outputting some action or policy and doesnât care how you do itâ vs. âA decision rule that prescribes following a particular set of cognitive steps that will then output some action or policyâ. (E.g., a rule that says âmaximize the aggregate welfare of moral patientsâ vs. a specific mental algorithm intended to achieve that end.)
The second distinction here is most closely related to the one I have in mind, although I wouldnât say itâs the same. Another way to express the distinction I have in mind is that itâs between (a) a normative claim and (b) a process of making decisions.
âHedonistic utilitarianism is correctâ would be a non-decision-theoretic example of (a). âMaking decisions on the basis of coinflipsâ would be an example of (b).
In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b).
I now have the sense Iâm probably not doing a good job of communicating what I have in mind, though.
The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that weâve even been misunderstanding basic things at the level of âdecision-theoretic criterion of rightnessâ.
I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding âhow decision-making works,â since normative claims donât typically seem to have any empirical/âmathematical/âetc. implications. For example, to again use a non-decision-theoretic example, I donât think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here.
Itâs that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. Thatâs part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess Iâm just not sure that the question that most academic decision theorists are trying to answerâand the literature theyâve produced on itâwill ultimately be very relevant.
The fact that CDT doesnât endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories donât), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.
The fact that R_CDT is âself-effacingââi.e. the fact that it doesnât always recommend following P_CDTâdefinitely does seem like a point of intuitive evidence against R_CDT.
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the âDonât Make Things Worse Principle,â which says that: Itâs not rational to take decisions that will definitely make things worse. Willâs Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his âGuaranteed Payoffs Principle.â
Thereâs then a question of which of these considerations is more relevant, when judging which of the two normative theories is more likely to be correct. The failure of R_UDT to satisfy the âDonât Make Things Worse Principleâ seems more important to me, but I donât really know how to argue for this point beyond saying that this is just my intuition. I think that the failure of R_UDT to satisfying this principleâor something like itâis also probably the main reason why many philosophers find it intuitively implausible.
(IIRC the first part of Reasons and Persons is mostly a defense of the view that the correct theory of rationality may be self-effacing. But Iâm not really familiar with the state of arguments here.)
In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agentsâ due to events that happened after she turned 20 (such as âthe summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theoryâ). But sheâll refuse to coordinate for reasons like âwe hung out a lot the summer before my 20th birthdayâ, âwe spent our whole childhoods and teen years living together and learning from the same teachersâ, and âwe all have similar decision-making faculties due to being members of the same speciesâ. Thereâs no principled reason to draw this temporal distinction; itâs just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.
I actually donât think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account wonât cause you to achieve better outcomes here.) So I donât think there should be any weird âFrankensteinâ decision procedure thing going on.
âŠ.Thinking more about it, though, Iâm now less sure how much the different normative decision theories should converge in their recommendations about AI design. I think they all agree that we should build systems that one-box in Newcomb-style scenarios. I think they also agree that, if weâre building twins, then we should design these twins to cooperate in twin prisonerâs dilemmas. But there may be some other contexts where acausal cooperation considerations do lead to genuine divergences. I donât have very clear/âsettled thoughts about this, though.
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the âDonât Make Things Worse Principle,â which says that: Itâs not rational to take decisions that will definitely make things worse. Willâs Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his âGuaranteed Payoffs Principle.â
I think âDonât Make Things Worseâ is a plausible principle at first glance.
One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the âDonât Make Things Worse Principleâ makes things worse.
Once youâve already adopted son-of-CDT, which says something like âact like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this ruleâ, itâs not clear to me why you wouldnât just go: âOh. CDT has lost the thing I thought made it appealing in the first place, this âDonât Make Things Worseâ feature. If weâre going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?â
A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demskiâs comments:
[...] In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/âEDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesnât look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didnât pay well.
[...] One way of thinking about this is to say that the FDT notion of âdecision problemâ is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified âbombâ with just the certain information that âleftâ is (causally and evidentially) very bad and ârightâ is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT ârejectsâ decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.
[...] This also hopefully clarifies the sense in which I donât think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
Thereâs a subtle point here, though, since Will describes the decision problem from an updated perspectiveâyou already know the bomb is in front of you. So UDT âchanges the problemâ by evaluating âaccording to the priorâ. From my perspective, because the very statement of theBombproblem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Letâs call the way-you-put-agents-into-the-scenario the âconstructionâ. We then evaluate agents on how well they deal with the construction.
For examples like Bomb, the construction gives us the overall probability distributionâthis is then used for the expected value which UDTâs optimality notion is stated in terms of.
For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility).
A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state.
This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation.
I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy âDonât Make Things Worseâ is more relevant when trying to judge the likelihood of a criterion of rightness being correct. Iâm not sure whether itâs possible to do much here other than present personal intuitions.
The point that R_UDT only violates the âDonât Make Things Worseâ principle only infrequently seems relevant, but Iâm still not sure this changes my intuitions very much.
If weâre going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?
I may just be missing something, but I donât see what this theoretical ugliness is. And I donât intuitively find the ugliness/âelegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct.
[[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because itâs costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isnât regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism is expressing a pretty mainstream position when it says: â[T]here is nothing incoherent about proposing a decision procedure that is separate from oneâs criterion of the right.⊠Criteria can, thus, be self-effacing without being self-refuting.â Insofar as people donât tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories, itâs not clear why they should buy it as a slam-dunk argument against the truth of normative decision theories.]]
is more relevant when trying to judge the likelihood of a criterion of rightness being correct
Sorry to drop in in the middle of this back and forth, but I am curiousâdo you think itâs quite likely that there is a single criterion of rightness that is objectively âcorrectâ?
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. âdonât make things worseâ, or âdonât be self-effacingâ). And so far there doesnât seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrowâs theorem, perhaps thereâs just no single perfect criterion of rightness.
In other words, once we agree that CDT doesnât make things worse, but that UDT is better as a general policy, is there anything left to argue about about which is âcorrectâ?
EDIT: Decided I had better go and read your Realism and Rationality post, and ended up leaving a lengthy comment there.
Sorry to drop in in the middle of this back and forth, but I am curiousâdo you think itâs quite likely that there is a single criterion of rightness that is objectively âcorrectâ?
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. âdonât make things worseâ, or âdonât be self-effacingâ). And so far there doesnât seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrowâs theorem, perhaps thereâs just no single perfect criterion of rightness.
Happy to be dropped in on :)
I think itâs totally conceivable that no criterion of rightness is correct (e.g. because the concept of a âcriterion of rightnessâ turns out to be some spooky bit of nonsense that doesnât really map onto anything in the real world.)
I suppose the main things Iâm arguing are just that:
When a philosopher expresses support for a âdecision theory,â they are typically saying that they believe some claim about what the correct criterion of rightness is.
Claims about the correct criterion of rightness are distinct from decision procedures.
Therefore, when a member of the rationalist community uses the word âdecision theoryâ to refer to a decision procedure, they are talking about something thatâs pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] donât directly speak to the questions that most academic âdecision theoristsâ are actually debating with one another.
I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. Iâm definitely not a super hardcore R_CDT believer.
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. âdonât make things worseâ, or âdonât be self-effacingâ). And so far there doesnât seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrowâs theorem, perhaps thereâs just no single perfect criterion of rightness.
I guess hereâin almost definitely too many wordsâis how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.)
It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.
One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isnât even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property Pâbut just drop the assumption that these things will also have both property Q1 and property Q2.
This obviously a pretty abstract description, so Iâll give a few examples. (No need to read the examples if the point seems obvious.)
Ethics: I might initially be inclined to think that itâs always ethical (property P) to maximize happiness and that itâs always unethical to torture people. But then I may realize that thereâs an inconsistency here: in at least rare circumstances, such as ticking time-bomb scenarios where torture can extract crucial information, there may be no decision that is both happiness maximizing (Q1) and torture-avoiding (Q2). It seems like a natural reaction here is just to drop either the belief that maximizing happiness is always ethical or that torture is always unethical. It doesnât seem like I need to abandon my belief that some actions have the property of being ethical.
Theology: I might initially be inclined to think that God is all-knowing, all-powerful, and all-good. But then I might come to believe (whether rightly or not) that, given the existance of evil, these three properties are inconsistent. I might then continue to believe that God exists, but just drop my belief that God is all-good. (To very awkwardly re-express this in the language of properties: This would mean dropping my belief that any entity that has the property of being God also has the property of being all-good).
Politician-bashing: I might initially be inclined to characterize some politician both as an incompetent leader and as someone whoâs successfully carrying out an evil long-term plan to transform the country. Then I might realize that these two characterizations are in tension with one another. A pretty natural reaction, then, might be to continue to believe the politician existsâbut just drop my belief that theyâre incompetent.
To turn to the case of the decision-theoretic criterion of rightness, I might initially be inclined to think that the correct criterion of rightness will satisfy both âDonât Make Things Worseâ and âNo Self-Effacement.â Itâs now become clear, though, that no criterion of rightness can satisfy both of these principles. I think itâs pretty reasoanble, then, to continue to believe that thereâs a correct criterion of rightnessâbut just drop the belief that the correct criterion of rightness will also satisfy âNo Self-Effacement.â
It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.
One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isnât even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property Pâbut just drop the assumption that these things will also have both property Q1 and property Q2.
I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, youâre just saying itâs âoftenâ natural, and I suppose itâs natural in some cases and not others. But I think we may disagree on how often itâs natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)
In particular, Iâm curious what makes you optimistic about finding a âcorrectâ criterion of rightness. In the case of the politician, it seems clear that learning they donât have some of the properties you thought shouldnât call into question whether they exist at all.
But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that thereâs no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and Iâm not sure I understand why.
My best guess, particularly informed by reading through footnote 15 on your Realism and Rationality post, is that when faced with ethical dilemmas (like your torture vs lollipop examples), it seems like there is a correct answer. Does that seem right?
(I realize at this point weâre talking about intuitions and priors on a pretty abstract level, so it may be hard to give a good answer.)
I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, youâre just saying itâs âoftenâ natural, and I suppose itâs natural in some cases and not others. But I think we may disagree on how often itâs natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)
In particular, Iâm curious what makes you optimistic about finding a âcorrectâ criterion of rightness. In the case of the politician, it seems clear that learning they donât have some of the properties you thought shouldnât call into question whether they exist at all.
But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that thereâs no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and Iâm not sure I understand why.
Hey again!
I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So hereâs an over-long document I probably shouldnât have written, which you are under no social obligation to read.
I think thereâs a key piece of your thinking that I donât quite understand /â disagree with, and itâs the idea that normativity is irreducible.
I think I follow you that if normativity were irreducible, then it wouldnât be a good candidate for abandonment or revision. But that seems almost like begging the question. I donât understand why itâs irreducible.
Suppose normativity is not actually one thing, but is a jumble of 15 overlapping things that sometimes come apart. This doesnât seem like it poses any challenge to your intuitions from footnote 6 in the document (starting with âI personally care a lot about the question: âIs there anything I should do, and, if so, what?ââ). And at the same time it explains why there are weird edge cases where the concept seems to break down.
So few things in life seem to be irreducible. (E.g. neither Eric nor Ben is irreducible!) So why would normativity be?
[You also should feel under no social obligation to respond, though it would be fun to discuss this the next time we find ourselves at the same party, should such a situation arise.]
This is a good discussion! Ben, thank you for inspiring so many of these different paths weâve been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions youâve been sharing are widespread enough that itâs very worthwhile to have public discussion on these points.
Therefore, when a member of the rationalist community uses the word âdecision theoryâ to refer to a decision procedure, they are talking about something thatâs pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems donât directly speak to the questions that most academic âdecision theoristsâ are actually debating with one another.
On the contrary:
MIRI is more interested in identifying generalizations about good reasoning (âcriteria of rightnessâ) than in fully specifying a particular algorithm.
MIRI does discuss decision algorithms in order to better understand decision-making, but this isnât different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzeniusâ response to this wasnât to go âalgorithms are uncouth in our fieldâ; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
MIRI does discuss âwhat decision procedure performs bestâ, but this isnât any different from traditional arguments in the field like ânaive EDT is wrong because it performs poorly in the smoking lesion problemâ. Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isnât different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a âdecision theory we should build into future AI systemsâ. Again, this doesnât seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.
I think itâs totally conceivable that no criterion of rightness is correct (e.g. because the concept of a âcriterion of rightnessâ turns out to be some spooky bit of nonsense that doesnât really map onto anything in the real world.)
Could you give an example of what the correctness of a meta-criterion like âDonât Make Things Worseâ could in principle consist in?
Iâm not looking here for a âreductionâ in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell whatâs âdecision-theoretically normativeâ in cases like this.
Human brains didnât evolve to have a primitive ânormativity detectorâ that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like âthings brains value as endsâ, âthings that are useful for various endsâ, âthings that accurately map states of affairsâ...
When I think of other examples of normativity, my sense is that in every case thereâs at least one good account of why a human might be able to distinguish âtrulyâ normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):
1. If I discover two alien species who disagree about the truth-value of âcarbon atoms have six protonsâ, I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.
2. If I discover two alien species who disagree about the truth value of âpawns cannot move backwards in chessâ or âthere are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmeticâ, then I can explain the rules of âproving things about chessâ or âproving things about PAâ as a symbol game, and write down strings of symbols that collectively constitute a âproofâ of the statement in question.
I can then assert that if any member of any species plays the relevant âproofâ game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.
(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once weâve accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between âfollow through on the proof processâ and âjust start doodling things or writing random letters downâ.)
The above doesnât answer questions like âdo the relevant symbols have Platonic objects as truthmakers or referents?â, or âwhy do we live in a consistent universe?â, or the like. But the above answer seems sufficient for rejecting any claim that thereâs something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödelâs first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.
3. If I discover two alien species who disagree about the truth-value of âsuffering is terminally valuableâ, then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches whatâs encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is âwrongâ.
Alternatively, I can make different âwrongâ predicates for each species: wronghuman, wrongalien1, wrongalien2, wronghousecat, etc.
This has the disadvantage of maybe making it sound like all these values are on âequal footingâ in an internally inconsistent way (âitâs wrong to put undue weight on whatâs wronghuman!â, where the first âwrongâ is secretly standing in for âwronghumanâ), but has the advantage of making it easy to see why the aliensâ disagreement might be important and substantive, while still allowing that aliensâ normative claims can be wrong (because they can be mistaken about their own core values).
The details of how to go from a brain to an encoding of âwhatâs rightâ seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, itâs going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.
If a criterion like âDonât Make Things Worseâ deserves a lot of weight, I want to know what that weight is coming from.
If the answer is âI know it has to come from something, but I donât know what yetâ, then that seems like a perfectly fine placeholder answer to me.
If the answer is âThis is like the âterminal valuesâ case, in that (I hypothesize) itâs just an ineradicable component of what humans care aboutâ, then that also seems structurally fine, though Iâm extremely skeptical of the claim that the âwarm glow of feeling causally efficaciousâ is important enough to outweigh other things of great value in the real world.
If the answer is âI think âDonât Make Things Worseâ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in lifeâ, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.
I mostly agree with this. I think the disagreement between CDT and FDT/âUDT advocates is less about definitions, and more about which of these things feels more compelling:
1. On the whole, FDT/âUDT ends up with more utility.
(I think this intuition tends to hold more force with people the more emotionally salient âmore utilityâ is to you. E.g., consider a version of Newcombâs problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your childâs life.)
2. Iâm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where Iâve already observed that the second box is empty), Iâm âgetting away with somethingâ and getting free utility that the FDT agent would miss out on.
(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that itâs âtoo lateâ for one-boxing to get you any more utility in this world.)
There are other considerations too, like how much it matters to you that CDT isnât self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. Itâs fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but itâs still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.
If a more disciplined and well-prepared version of you would have one-boxed, then isnât there something off about saying that two-boxing is in any sense âcorrectâ? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).
Mind you, if the sentence âCDT is the most rational decision theoryâ is true in some substantive, non-trivial, non-circular sense, then Iâm inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think itâs better to accept that truth than to linguistically conceal it.
But the arguments Iâve seen for âCDT is the most rational decision theoryâ to date have struck me as either circular, or as reducing to âI know CDT doesnât get me the most utility, but something about it just feels rightâ.
Itâs fine, I think, if âit just feels rightâ is meant to be a promissory note for some forthcoming account â a clue that thereâs some deeper reason to favor CDT, though we havenât discovered it yet. As the FDT paper puts it:
These are odd conclusions. It might even be argued that sufficiently odd behavior provides evidence that what FDT agents see as ârationalâ diverges from what humans see as ârational.â And given enough divergence of that sort, we might be justified in predicting that FDT will systematically fail to get the most utility in some as-yet-unknown fair test.
On the other hand, if âit just feels rightâ is meant to be the final word on why âCDT is the most rational decision theoryâ, then I feel comfortable saying that ârationalâ is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.
My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.
Superstitious people often believe that itâs possible to directly causally influence things across great distances of time and space. At a glance, FDTâs prescription (âone-box, even though you canât causally affect whether the box is fullâ) as well as its account of how and why this works (âyou can somehow âcontrolâ the properties of abstract objects like âdecision functionsââ) seem weird and spooky in the manner of a superstition.
FDTâs response: if a thing seems spooky, thatâs a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failureâin the case of decision theory, some systematic loss of utility that isnât balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool âmagical thinkingâ starts to ring rather hollow. At that point, itâs necessary to consider the possibility that FDT is counter-intuitive but correct (like Einsteinâs âspukhafte Fernwirkungâ), rather than magical.
In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:
2. Iâm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where Iâve already observed that the second box is empty), Iâm âgetting away with somethingâ and getting free utility that the FDT agent would miss out on.
The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the âdeterministic subprocessâ view of our decision-making, we would find nothing strange about the idea that itâs sometimes right for this subprocess to do locally incorrect things for the sake of better global results.
E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brainâs decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then thereâs nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, itâs just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.
(Note how a lot of these points are more intuitive in CS language. I donât think itâs a coincidence that people coming from CS were able to improve on academic decision theoryâs ideas on these points; I think itâs related to what kinds of stumbling blocks get in the way of thinking in these terms.)
Suppose you initially tell yourself:
âIâm going to one-box in all strictly-future transparent Newcomb problems, since this produces more expected causal (and evidential, and functional) utility. One-boxing and receiving $1M in 99% of future states is worth the $1000 cost of one-boxing in the other 1% of future states.â
Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.
I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that youâre a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDTâs decision seem reasonable in this case.
In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that weâd been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you âgetting away with somethingâ, itâs really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).
I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments Iâve seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.
In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:
Iâm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where Iâve already observed that the second box is empty), Iâm âgetting away with somethingâ and getting free utility that the FDT agent would miss out on.
The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the âdeterministic subprocessâ view of our decision-making, we would find nothing strange about the idea that itâs sometimes right for this subprocess to do locally incorrect things for the sake of better global results.
Is the following a roughly accurate re-characterization of the intuition here?
âSuppose that thereâs an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesnât really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, thereâs no relevant sense in which the agent should have two-boxed.â
If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent âshould haveâ one-boxedâsince in a physically deterministic sense it could not have. Therefore, I think itâs probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other.
In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. Soâinsofar as we accept this kind of objection from determinismâthere seems to be something problematically non-naturalistic about discussing what âwould have happenedâ if we built in one decision procedure or another.
Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, thereâs no relevant sense in which the agent should have two-boxed.â
No, I donât endorse this argument. To simplify the discussion, letâs assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, thereâs only one possible outcome; but it would be odd to say that a decision stops being a decision just because itâs determined by something. (Whatâs the alternative?)
I do endorse âgiven the predictorâs perfect accuracy, itâs impossible for the P_UDT agent to two-box and come away with $1,001,000â. I also endorse âgiven the predictorâs perfect accuracy, itâs impossible for the P_CDT agent to two-box and come away with $1,001,000âł. Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesnât mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.
(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this âdominanceâ argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly donât think agents âshouldâ try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcombâs problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)
In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also âcould have two-boxedâ, but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what theyâll choose, the uncertainty goes away as soon as they see whether the box is full.
In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only choice of decision procedure is physically consistent with the state of the world at the time the choice is made. Soâinsofar as we accept this kind of objection from determinismâthere seems to be something problematically non-naturalistic about discussing what âwould have happenedâ if we built in one decision procedure or another.
No, thereâs nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.
The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesnât mean that we canât choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either âI think Iâll follow P_FDT from now onâ or âI think Iâll follow P_CDT from now onâ. Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make âreal decisionsâ, for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.
It isnât problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/âimpossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are âliveâ, are possible, until the very moment we decide.
I certainly donât think agents âshouldâ try to achieve outcomes that are impossible from the problem specification itself.
I think you need to make a clearer distinction here between âoutcomes that donât exist in the universeâs dynamicsâ (like taking both boxes and receiving $1,001,000) and âoutcomes that canât exist in my branchâ (like there not being a bomb in the unlucky case). Because if youâre operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations).
And, to be clear, I do think agents âshouldâ try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents âshouldâ try to achieve outcomes that are already known to have happened from the problem specification including observations.
As an example, if youâre in Parfitâs Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where youâre deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.
Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this âdominanceâ argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly donât think agents âshouldâ try to achieve outcomes that are impossible from the problem specification itself.
Suppose that we accept the principle that agents never âshouldâ try to achieve outcomes that are impossible from the problem specificationâwith one implication being that itâs false that (as R_CDT suggests) agents that see a million dollars in the first box âshouldâ two-box.
This seems to imply that itâs also false that (as R_UDT suggests) an agent that sees that the first box is empty âshouldâ one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you donât want to say that.
I think I probably still donât understand your objection hereâso Iâm not sure this point is actually responsive to itâbut I initially have trouble seeing what potential violations of naturalism/âdeterminism R_CDT could be committing that R_UDT would not also be committing.
(Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)
But the arguments Iâve seen for âCDT is the most rational decision theoryâ to date have struck me as either circular, or as reducing to âI know CDT doesnât get me the most utility, but something about it just feels rightâ.
It seems to me like theyâre coming down to saying something like: the âGuaranteed Payoffs Principleâ /â âDonât Make Things Worse Principleâ is more core to rational action than being self-consistent. Whereas others think self-consistency is more important.
Mind you, if the sentence âCDT is the most rational decision theoryâ is true in some substantive, non-trivial, non-circular sense
Itâs not clear to me that the justification for CDT is more circular than the justification for FDT. Doesnât it come down to which principles you favor?
Maybe you could say FDT is more elegant. Or maybe that it satisfies more of the intuitive properties weâd hope for from a decision theory (where elegance might be one of those). But Iâm not sure that would make the justification less-circular per se.
I guess one way the justification for CDT could be more circular is if the key or only principle that pushes in favor of it over FDT can really just be seen as a restatement of CDT in a way that the principles that push in favor of FDT do not. Is that what you would claim?
Whereas others think self-consistency is more important.
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
Itâs not clear to me that the justification for CDT is more circular than the justification for FDT. Doesnât it come down to which principles you favor?
FDT gets you more utility than CDT. If you value literally anything in life more than you value âwhich ritual do I use to make my decisions?â, then you should go with FDT over CDT; thatâs the core argument.
This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think thereâs no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).
The latter argument for CDT isnât circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
I do think the argument ultimately needs to come down to an intuition about self-effacingness.
The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.
But thereâs nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, thereâs also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.
So why shouldnât I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.
More formally, it seems like the argument needs to be something along these lines:
Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Thereforeâas an implication of points (3) and (4) -- R_CDT is not true.
Whether you buy the âNo Self-Effacementâ assumption in Step 4 -- or, alternatively, the countervailing âDonât Make Things Worseâ assumption that supports R_CDTâseems to ultimately be a mattter of intuition. At least, I donât currently know what else people can appeal to here to resolve the disagreement.
[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesnât specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I donât think this ambiguity matters much for the argument.]]
[[SECOND SIDENOTE: Iâm using the phrase âself-effacingâ rather than âself-contradictoryâ here, because I think itâs more standard and because âself-contradictoryâ seems to suggest logical inconsistency.]]
But thereâs nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT.
If the thing being argued for is âR_CDT plus P_SONOFCDTâ, then that makes sense to me, but is vulnerable to all the arguments Iâve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDTâs âDonât Make Things Worseâ principle.
If the thing being argued for is âR_CDT plus P_FDTâ, then I donât understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over âR_FDT plus P_FDTâ? (Indeed, what difference between the two views would be intended here?)
So why shouldnât I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be not self-effacing.
The argument against âR_CDT plus P_SONOFCDTâ doesnât require any mention of self-effacingness; itâs entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.
The argument against âR_CDT plus P_FDTâ seems to demand some reference to self-effacingness or inconsistency, or triviality /â lack of teeth. But I donât understand what this view would mean or why anyone would endorse it (and I donât take you to be endorsing it).
For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I donât think this ambiguity matters much for the argument.
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what âexpected utilityâ means.
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT âgets you less utilityâ rather than the point that P_SONOFCDT âgets you less utility.â So my comment was aiming to explain why I donât think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the âNo Self-Effacement Principleâ). But it sounds like you might agree that this fact doesnât on its own provide a strong challenge.
If the thing being argued for is âR_CDT plus P_SONOFCDTâ, then that makes sense to me, but is vulnerable to all the arguments Iâve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDTâs âDonât Make Things Worseâ principle.
In response to the first argument alluded to here: âGets the most [expected] utilityâ is ambiguous, as I think weâve both agreed.
My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. Soâif we evaluate the expected utility of a decision to adopt a policy from a casual perspectiveâit seems to me that P_SONOFCDT âgets the most expected utility.â
If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may âget the most utilityâ (because policy adoption decisions may be non-causally correlated.)
Apologies if Iâm off-base, but it reads to me like you might be suggesting an argument along these lines:
R_CDT says that it is rational to decide to follow a policy that would not maximize âexpected utilityâ (defined in evidential/âsubjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize âexpected utilityâ (defined in evidential/âsubjunctive terms).
Therefore R_CDT is not true.
The natural response to this argument is that itâs not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its âexpected utilityâ defined in causal terms. So someone starting from the position that R_CDT is true obviously wonât accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its âexpected utilityâ defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.
Overall at least this particular simple argumentâthat R_CDT is false because P_SONOFCDT gets less âexpected utilityâ as defined in evidential/âquasi-evidential termsâwould seemingly fail to due circularity. But you may have in mind a different argument.
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what âexpected utilityâ means.
I felt confused by this comment. Doesnât even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you donât want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more ârationalâ decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?
You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit toâP_SONOFCDTâsometimes outputs decisions that definitely make things worse. So âDonât Make Things Worseâ implies that some of the decisions outputted by P_SONOFCDT are irrational.
But I still donât see what the argument is here unless weâre assuming âNo Self-Effacement.â It still seems to me like we have a few initial steps and then a missing piece.
(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the âDonât Make Things Worseâ Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.
The âNo Self-Effacementâ Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume âNo Self-Effacementâ in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if weâre not assuming âNo Self-Effacement,â then itâs not clear to me how we get there.
Actually, in the context of this particular argument, I suppose we donât really have the option of assuming that âNo Self-Effacementâ is trueâbecause this assumption would be inconsistent with the earlier assumption that âDonât Make Things Worseâ is true. So Iâm not sure itâs actually possible to make this argument schema work in any case.
There may be a pretty different argument here, which you have in mind. I at least donât see it yet though.
There may be a pretty different argument here, which you have in mind. I at least donât see it yet though.
Perhaps the argument is something like:
âDonât make things worseâ (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_âČs, and not about P_âČs
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldnât get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_âČs, and not about P_âČs
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldnât get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
Here are two logically inconsistent principles that could be true:
Donât Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Donât Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since theyâre mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.
One could argue that R_CDT sympathists donât actually have much stronger intuitions regarding the first principle than the secondâi.e. that their intuitions arenât actually very âtargetedâ on the first oneâbut I donât think that would be right. At least, itâs not right in my case.
A more viable strategy might be to argue for something like a meta-principle:
The âDonât Make Things Worseâ Meta-Principle: If you find âDonât Make Things Worseâ strongly intuitive, then you should also find âDonât Commit to a Policy That In the Future Will Sometimes Make Things Worseâ just about as intuitive.
If the meta-principle were true, then I guess this would sort of imply that peopleâs intuitions in favor of âDonât Make Things Worseâ should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.
But I donât see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of âDonât Make Things Worseâ :)
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.
Bomb Case: Omega puts a million dollars in a transparent box if he predicts youâll open it. He puts a bomb in the transparent box if he predicts you wonât open it. Heâs only wrong about one in a trillion times.
Now suppose you enter the room and see that thereâs a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and donât open the box, then nothing bad will happen to you. Youâll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.
Intuitively, this decision strikes me as deeply irrational. Youâre intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that youâre flagrantly violating the âDonât Make Things Worseâ principle.
Now, letâs step back a time step. Suppose you know that youâre sort of person who would refuse to kill yourself by detonating the bomb. You might decide thatâsince Omega is such an accurate predictorâitâs worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision youâre making now is nonetheless making things better in expectation.
This decision strikes me as pretty intuitively rational. Youâre violating the second principleâthe âDonât Commit to a Policy...â Principleâbut this violation just doesnât seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.
(This obviously just a description of my own intuitions, as they stand, though.)
It feels very relevant that youâre flagrantly violating the âDonât Make Things Worseâ principle.
By triggering the bomb, youâre making things worse from your current perspective, but making things better from the perspective of earlier you. Doesnât that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you donât. After updating, you think youâre either a simulation within Omegaâs prediction so your action has no effect on yourself or youâre in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/âinconsistency.
Giving the human tendency to change our (UDT-)utility functions by updating, itâs not clear what to do (or what is right), and I think this reduces UDTâs intuitive appeal and makes it less of a slam-dunk over CDT/âEDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isnât adequately explained in MIRIâs decision theory papers.)
Donât Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Donât Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
...
One could argue that R_CDT sympathists donât actually have much stronger intuitions regarding the first principle than the secondâi.e. that their intuitions arenât actually very âtargetedâ on the first oneâbut I donât think that would be right. At least, itâs not right in my case.
I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that thatâs right in your case.)
But I feel like the second doesnât quite capture what I had in mind regarding the DMTW intuition applied to P_âČs.
Consider an alternate version:
If a decision would definitely make things worse, then taking that decision is not good policy.
Or alternatively:
If a decision would definitely make things worse, a rational person would not take that decision.
It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the â⊠then taking that decision is not rational.â version is. And itâs only after youâve considered prisonersâ dilemmas or Newcombâs paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from whatâs rational in the moment.
(But maybe others would disagree on how intuitive these versions are.)
EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then itâs not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.
Of course, this argument is moot if itâs true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe thatâs the case? But I think itâs a little more ambiguous with the â⊠is not good policyâ or âa rational person would not...â versions than with the âDonât commit to a policy...â version.
EDIT2: Does what Iâm trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
If the thing being argued for is âR_CDT plus P_SONOFCDTâ ⊠If the thing being argued for is âR_CDT plus P_FDT...
Just as a quick sidenote:
Iâve been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.
If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.
The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they arenât). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.
[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that youâre building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisonerâs dilemma (e.g. some version of P_FDT). But if R_CDT is true and youâve just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisonerâs dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative viewâwhich Iâve alluded to aboveâis that the various proposed criteria of rightness donât in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
Another way to express the distinction I have in mind is that itâs between (a) a normative claim and (b) a process of making decisions.
This is similar to how you described it here:
Letâs suppose that some decisions are rational and others arenât. We can then ask: What is it that makes a decision rational? What are the necessary and/âor sufficient conditions? I think that this is the question that philosophers are typically trying to answer. [...]
When philosophers talk about âCDT,â for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, âCDTâ is the claim that a decision is rational iff taking it would cause the largest expected increase in value. To avoid any ambiguity, letâs label this claim R_CDT.
We can also talk about âdecision procedures.â A decision procedure is just a process or algorithm that an agent follows when making decisions.
This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if itâs normative, it can be either an algorithm/âprocedure thatâs being recommended, or a criterion of rightness like âa decision is rational iff taking it would cause the largest expected increase in valueâ (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are ânormativeâ or âendorsedâ).
Some of your discussion above seems to be focusing on the âalgorithmic?â dimension, while other parts seem focused on ânormative?â. Iâll say more about ânormative?â here.
The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think theyâre pretty concrete and crisply defined. Itâs harder for me to accidentally switch topics or bundle two different concepts together when talking about âtrying to optimize vs. optimizing as a side-effectâ, âdirectly optimizing vs. optimizing via heuristicsâ, âinitially optimizing vs. self-modifying to optimizeâ, or âfunction vs. algorithmâ.
In contrast, I think ânormativeâ and ârationalâ can mean pretty different things in different contexts, itâs easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of whatâs at stake in the discussion.
E.g., ânormativeâ is often used in the context of human terminal values, and itâs in this context that statements like this ring obviously true:
I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding âhow decision-making works,â since normative claims donât typically seem to have any empirical/âmathematical/âetc. implications. For example, to again use a non-decision-theoretic example, I donât think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making.
If weâre treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:
Decision theory isnât normative.
Decision theory is normative in the way that âmurder is badâ or âimproving aggregate welfare is goodâ is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).
Probability theory has obvious normative force in the context of reasoning and decision-making, but itâs not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.
A lot of the examples youâve cited are theories from moral philosophy about whatâs terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; itâs not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.
You could indeed define an agent that terminally values making CDT-style decisions, but I donât think most proponents of CDT or EDT would claim that their disagreement with UDT/âFDT comes down to a values disagreement like that. Rather, theyâd claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about whoâs mistaken.)
I actually donât think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account wonât cause you to achieve better outcomes here.)
In the twin prisonerâs dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).
I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/âor uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.
Decision theory is normative in the way that âmurder is badâ or âimproving aggregate welfare is goodâ is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).
[[Disclaimer: Iâm not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]]
I think the kind of decision theory that philosophers tend to work on is typically explicitly described as ânormative.â (For example, the SEP article on decision theory is about ânormative decision theory.â) So when Iâm talking about âacademic decision theoriesâ or âproposed criteria of rightnessâ Iâm talking about normative theories. When I use the word ârationalâ Iâm also referring to a normative property.
I donât think thereâs any very standard definition of what it means for something to be normative, maybe because itâs often treated as something pretty close to a primitive concept, but a partial account is that a ânormative theoryâ is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one).
Some normative theories concern âends.â These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. Iâm not sure what the best terminology is, and think this choice is probably relatively non-standard, but letâs label these âmoral theories.â
Some normative theories, including âdecision theories,â concern âmeans.â These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/âimplications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alternative theories) say different things. Typical versions of CDT and EDT flesh out expected utility theory in different ways to specify what the relevant measure of âexpected fulfillmentâ is.
Moral theory and normative decision theory seem to me to have pretty much the same status. They are both bodies of theory that bear on what people should do. On some views, the division between them is more a matter of analytic convenience than anything else. For example, David Enoch, a prominent meta-ethicist, writes: âIn fact, I think that for most purposes [the line between the moral and the non-moral] is not a line worth worrying about. The distinction within the normative between the moral and the non-moral seems to me to be shallow compared to the distinction between the normative and the non-normativeâ (Taking Morality Seriously, 86).
One way to think of moral theories and normative decision theories is as two components that fit together to form more fully specified theories about what people should do. Moral theories describe the ends people should pursue; given these ends, decision theories then describe what actions people should take when in states of uncertainty. To illustrate, two examples of more complete normative theories that combine moral and decision-theoretic components would be: âYou should take whatever action would in expectation cause the largest increase in the fulfillment of your preferencesâ and âYou should take whatever action would, if you took it, lead you to anticipate the largest expected amount of future happiness in the world.â The first is subjectivism combined with CDT, while the second is total view hedonistic utilitarianism combined with EDT.
(On this conception, a moral theory is not a description of âan arbitrary terminal value of human beings.â Decision theory here also is not âthe study of which decision-making methods humans happen to terminally prefer to employ.â These are both theories are about what people should do, rather than theories about about what peopleâs preferences are.)
Normativity is obviously pretty often regarded as a spooky or insufficiently explained thing. So a plausible position is normative anti-realism: It might be the case that no normative claims are true, either because theyâre all false or because theyâre not even well-formed enough to take on truth values. If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesnât really have an answer.
In the twin prisonerâs dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).
If Iâm someone with a twin and Iâm implementing P_CDT, I still donât think I will choose to modify myself to cooperate in twin prisonerâs dilemmas. The reason is that modifying myself wonât cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.
(The fact P_CDT agents wonât modify themselves to cooperate with their twins could of course be interpretted as a mark against R_CDT.)
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!
If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesnât really have an answer.
Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeksâ astronomical observations, and some of their theories and predictive tools, were still true and useful.
I think that terms like ânormativeâ and ârationalâ are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauserâs pluralistic moral reductionism).
I would say that (1) some philosophers use ârationalâ in a very human-centric way, which is fine as long as itâs done consistently; (2) others have a much more thin conception of ârationalâ, such as âtending to maximize utilityâ; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of ârationalityâ, but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.
I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcombâs problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer whoâs accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)
Iâd expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/â3 than type-1. (And in this case, even type-1 analyses of ârationalityâ can be helpful for bringing various widespread background assumptions to light.)
If Iâm someone with a twin and Iâm implementing P_CDT, I still donât think I will choose to modify myself to cooperate in twin prisonerâs dilemmas. The reason is that modifying myself wonât cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.
This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDTâs prescriptions from now on.
Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasnât happened yet.
(And yes, by the time you actually end up in the prisonerâs dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDTâs prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDTâs future prescriptions.)
(If it helps, donât think of the copy of you as âyouâ: just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!
Thank you for taking the time to respond as well! :)
I think that terms like ânormativeâ and ârationalâ are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauserâs pluralistic moral reductionism).
I would say that (1) some philosophers use ârationalâ in a very human-centric way, which is fine as long as itâs done consistently; (2) others have a much more thin conception of ârationalâ, such as âtending to maximize utilityâ; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of ârationalityâ, but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.
Iâm not positive I understand what (1) and (3) are referring to here, but I would say that thereâs also at least a fourth way that philosophers often use the word ârationalâ (which is also the main way I use the word ârational.â) This is to refer to an irreducibly normative concept.
The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. âreducedâ). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible wordâit requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial âbasisâ of primitive concepts that we use to describe the rest of the concepts.
Some examples of concepts that are arguably irreducible are âtruth,â âset,â âproperty,â âphysical,â âexistance,â and âpoint.â Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further.
To focus on the example of âtruth,â some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that âX is trueâ what they really mean or should mean is âI personally believe Xâ or âbelieving X is good for you.â But I think these suggested reductions pretty obviously donât entirely capture what people mean when they say âX is true.â The phrase âX is trueâ also has an important meaning that is not amenable to this sort of reduction.
[[EDIT: âTruthâ may be a bad example, since itâs relatively controversial and since Iâm pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argument that some concepts will necessarily be irreducible, the particular choice of example used here isnât essential to the overall point.]]
Some philosophers also employ normative concepts that they say cannot be reduced in terms of non-normative (e.g. psychological) properties. These concepts are said to be irreducibly normative.
For example, here is Parfit on the concept of a normative reason (OWM, p. 1):
We can have reasons to believe something, to do something, to have some desire or aim, and to have many other attitudes and emotions, such as fear, regret, and hope. Reasons are given by facts, such as the fact that someoneâs finger-prints are on some gun, or that calling an ambulance would save someoneâs life.
It is hard to explain the concept of a reason, or what the phrase âa reasonâ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But âcounts in favour ofâ means roughly âgives a reason forâ. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words. We must explain such concepts in a different way, by getting people to think thoughts that use these concepts. One example is the thought that we always have a reason to want to avoid being in agony.
When someone says that a concept they are using is irreducible, this is obviously some reason for suspicion. A natural suspicion is that the real explanation for why they canât give a useful description is that the concept is seriously muddled or fails to grip onto anything in the real world. For example, whether this is fair or not, I have this sort of suspicion about the concept of âdaoâ in daoist philosophy.
But, again, it will necessarily be the case that some useful and valid concepts are irreducible. So we should sometimes take evocations of irreducible concepts seriously. A concept that is mostly undefined is not always problematically âunderdefined.â
When I talk about ânormative anti-realism,â I mostly have in mind the position that claims evoking irreducably normative concepts are never true (either because these claims are all false or because they donât even have truth values). For example: Insofar as the word âshouldâ is being used in an irreducibly normative sense, there is nothing that anyone âshouldâ do.
[[Worth noting, though: The term ânormative realismâ is sometimes given a broader definition than the one Iâve sketched here. In particular, it often also includes a position known as âanalytic naturalist realismâ that denies the relevance of irreducibly normative concepts. I personally feel I understand this position less well and I think sometimes waffle between using the broader and narrower definition of ânormative realism.â I also more generally want to stress that not everyone who makes claims about âcriterion of rightnessâ or employs other seemingly normative language is actually a normative realist in the narrow or even broad sense; what Iâm doing here is just sketching one common especially salient perspective.]]
One motivation for evoking irreducibly normative concepts is the observation thatâin the context of certain discussionsâitâs not obvious that thereâs any close-to-sensible way to reduce the seemingly normative concepts that are being used.
For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of âa rational choiceâ to the concept of âa winning choiceâ (or, in line with the type-2 conception you mention, a âutility-maximizing choiceâ). It seems difficult to make sense of a lot of basic claims about rationality if we use this reductionâand other obvious alternative reductions donât seem to fair much better. To mostly quote from a comment I made elsewhere:
Suppose we want to claim that it is rational to try to maximize the expected winning (i.e. the expected fulfillment of your preferences). Due to randomness/âuncertainty, though, an agent that tries to maximize expected âwinningâ wonât necessarily win compared to an agent that does something else. If I spend a dollar on a lottery ticket with a one-in-a-billion chance of netting me a billion-and-one âwin points,â then Iâm taking the choice that maximizes expected winning but Iâm also almost certain to lose. So we canât treat âthe rational actionâ as synonymous with âthe action taken by an agent that wins.â
We can try to patch up the issue here by reducing âthe rational actionâ to âthe action that is consistent with the VNM axioms,â but in fact either action in this case is consistent with the VNM axioms. The VNM axioms donât imply that an agent must maximize the expected desirability of outcomes. They just imply that an agent must maximize the expected value of some function. It is totally consistent with the axioms, for example, to be effectively risk averse and instead maximize the expected square root of desirability. If we try to define âthe action I should takeâ in this way, then the claim âit is rational to act consistently with the VNM axiomsâ also becomes an empty tautology.
We could of course instead reduce âthe rational actionâ to âthe action that maximizes expected winning.â But now, of course, the claim âit is rational to maximize expected winningâ no longer has any substantive content. When we make this claim, do we really mean to be stating an empty tautology? And do we really consider it trivially incoherent to wonderâe.g. in a Pascalâs mugging scenarioâwhether it might be ârationalâ to take an action other than the one that maximizes expected winning? If not, then this reduction is a very poor fit too.
It ultimately seems hard, at least to me, to make non-vacuous true claims about what itâs ârationalâ to do withoit evoking a non-reducible notion of ârationality.â If we are evoking a non-reducible notion of rationality, then it makes sense that we canât provide a satisfying reduction.
At the same time, though, I do think there are also really good and hard-to-counter epistemological objections to the existance of irreducibly normative properties (e.g. the objection described in this paper). You might also find the difficulty of reducing normative concepts a lot less obvious-seeming or problematic than I do. You might think, for example, that the difficulty of reducing ârationalityâ is less like the difficulty of reducing âtruthâ (which IMO mainly reflects the fact that truth is an important primitive concept) and more like the difficulty of defining the word âsoupâ in a way that perfectly matches our intuitive judgments about what counts as âsoupâ (which IMO mainly reflects the fact that âsoupâ is a high-dimensional concept). So I definitely donât want to say normative realism is obviously or even probably right.
I mainly just want to communicate the sort of thing that I think a decent chunk of philosophers have in mind when they talk about a ârational decisionâ or a âcriterion of rightness.â Although, of course, philosophy being philosophy, plenty of people do of course have in mind plenty of different things.
So, as an experiment, Iâm going to be a very obstinate reductionist in this comment. Iâll insist that a lot of these hard-seeming concepts arenât so hard.
Many of them are complicated, in the fashion of âknowledgeââthey admit an endless variety of edge cases and exceptionsâbut these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where thereâs a simple core we can point to, that core generally isnât mysterious.
It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it âisnât mysteriousâ, I mean itâs pretty easy to see how the concept can crop up in human thought even if it doesnât belong on the short list of deep fundamental cosmic structure terms.
I would say that thereâs also at least a fourth way that philosophers often use the word ârational,â which is also the main way I use the word ârational.â This is to refer to an irreducibly normative concept.
Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like âlove,â âknowledge,â âFranceâ) , or itâs not (in which case it goes in bucket 2).
Some examples of concepts that are arguably irreducible are âtruth,â âset,â âproperty,â âphysical,â âexistance,â and âpoint.â
Picking on the concept here that seems like the odd one out to me: I feel confident that there isnât a cosmic law (of nature, or of metaphysics, etc.) that includes âtruthâ as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like âintentionality/âreferenceâ, âassertionâ, or âstate of affairsâ, though the former two strike me as easy to explain in simple physical terms.
Mundane empirical âtruthâ seems completely straightforward. Then thereâs the truth of sentences like âFrodo is a hobbitâ, â2+2=4â, âI could have been the presidentâ, âHamburgers are more delicious than battery acidâ⊠Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word âtrueâ to refer to all these different statements is pre-philosophical, and doesnât reflect anything deeper than that âeach of these sentences at least superficially looks like itâs asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic communityâ.
I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think thereâs also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of âFrodo is a hobbitâ doesnât mean youâre learning something deep about the universe (or even something particularly deep about human cognition) in the process.
[Parfit:] It is hard to explain the concept of a reason, or what the phrase âa reasonâ means. Facts give us reasons, we might say, when they count in favour of our having some attitude, or our acting in some way. But âcounts in favour ofâ means roughly âgives a reason forâ. Like some other fundamental concepts, such as those involved in our thoughts about time, consciousness, and possibility, the concept of a reason is indefinable in the sense that it cannot be helpfully explained merely by using words.
Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, âTo what extent is observation o a [WORD] to believe hypothesis h?â When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, âTo what extent is P(h)=0.7 a [WORD] to choose action a?â As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.
When two robots meet, we can moreover add that they negotiate a joint âcompromiseâ goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using â[WORD]â where an action is being evaluated relative to the joint goal, not just the robotâs original goal.
Thus when Robot A tells Robot B âI assign probability 90% to âitâs noonâ, which is [WORD] to have lunchâ, A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and Bâs joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)
If you decide to relabel â[WORD]â as âreasonâ, I claim that this captures a decent chunk of how people use the phrase âa reasonâ. âReasonâ is a suitcase word, but that doesnât mean there are no similarities between e.g. âdata my goals endorse using to adjust the probability of a given hypothesisâ and âprobabilities-of-hypotheses my goals endorse using to select an actionâ), or that the similarity is mysterious and ineffable.
(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)
For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of âa rational choiceâ to the concept of âa winning choiceâ (or, in line with the type-2 conception you mention, a âutility-maximizing choiceâ).
That essay isnât trying to âreduceâ the term ârationalityâ in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of âwhat the right thing to do isâ, they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).
Therefore: if youâre going to use words like ârationalityâ, make sure that the words in question wonât cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you arenât using ârationalityâ in a safe ânailed-to-utilityâ way, make sure that youâre willing to turn on a time and stop being ârationalâ the second your conception of rationality starts telling you to throw away value.
It ultimately seems hard, at least to me, to make non-vacuous true claims about what itâs ârationalâ to do withoit evoking a non-reducible notion of ârationality.â
âRationalityâ is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just â(systematized) winningâ but (as noted in the essay) âBayesian reasoningâ, or in Rationality: Appreciating Cognitive Algorithms, âcognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievementâ. In philosophy, the list is a lot longer.
The common denominator seems to largely be âsomething something reasoning /â deliberationâ plus (as you note) âsomething something normativity /â desirability /â recommendedness /â requirednessâ.
The idea of ânormativityâ doesnât currently seem that mysterious to me either, though youâre welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:
Goals/âpreferences/âvalues, especially overridingly strong ones.
Encouraged, endorsed, mandated, or praised conduct.
Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I donât think itâs particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.
I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.
(I donât really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort Iâve just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments youâve just made cover some reasons to suspect this.)
The key point is just that when these philosophers say that âAction X is rational,â they are explicitly reporting that they do not mean âAction X suits my terminal preferencesâ or âAction X would be taken by an agent following a policy that maximizes lifetime utilityâ or any other such reduction.
I think that when people are very insistent that they donât mean something by their statements, it makes sense to believe them. This implies that the question they are discussingââWhat are the necessary and sufficient conditions that make a decision rational?ââis distinct from questions like âWhat decision would an agent that tends to win take?â or âWhat decision procedure suits my terminal preferences?â
It may be the case that the question they are asking is confused or insensibleâbecause any sensible question would be reducibleâbut itâs in any case different. So I think itâs a mistake to interpret at least these philosophersâ discussions of âdecisions theoriesâ or âcriteria of rightnessâ as though they were discussions of things like terminal preferences or winning strategies. And it doesnât seem to me like the answer to the question theyâre asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.
[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. Itâs less clear to me how related or unrelated the thing theyâre talking about is to issues of interest to MIRI. I think that the conception of rationality Iâm discussing here mainly just presents an especially clear case.]]
This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if itâs normative, it can be either an algorithm/âprocedure thatâs being recommended, or a criterion of rightness like âa decision is rational iff taking it would cause the largest expected increase in valueâ (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are ânormativeâ or âendorsedâ).
Just on this point: I think youâre right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context:
Decisions
Decision procedures
The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility)
The set of properties that makes a decision rational (âcriterion of rightnessâ)
A claim about what the criterion of rightness is (ânormative decision theoryâ)
The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness)
(4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity.
My current-although-not-firmly-held view is also that (6) probably isnât very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.
Lightly editing some thoughts I previously wrote up on this issue, somewhat in line with Paulâs comments:
Rationalist community writing on decision theory sometimes seems to switch back and forth between describing decision theories as normative principles (which I believe is how academic philosophers typically describe decision theories) and as algorithms to be used (which seems to be inconsistent with how academic philosophers typically describe decision theories). I think this tendency to switch back and forth between describing decision theories in these two distinct ways can be seen both in papers proposing new decision theories and in online discussions. I also think this switching tendency can make things pretty confusing. Although it makes sense to discuss how an algorithm âperformsâ when âimplemented,â once we specify a sufficiently precise performance metric, it does not seem to me to make sense to discuss the performance of a normative principle. I think the tendency to blur the distinction between algorithms and normative principlesâor, as Will MacAskill puts it in his recent and similar critique, between âdecision proceduresâ and âcriteria of rightnessââpartly explains why proponents of FDT and other new decision theories have not been able to get much traction with academic decision theorists.
For example, causal decision theorists are well aware that people who always take the actions that CDT says they should take will tend to fare less well in Newcomb scenarios than people who always take the actions that EDT says they should take. Causal decision theorists are also well aware that that there are some scenariosâfor example, a Newcomb scenario with a perfect predictor and the option to get brain surgery to pre-commit yourself to one-boxingâin which there is no available sequence of actions such that CDT says you should take each of the actions in the sequence. If you ask a causal decision theorist what sort of algorithm you should (according to CDT) put into an AI system that will live in a world full of Newcomb scenarios, if the AI system wonât have the opportunity to self-modify, then I think itâs safe to say a causal decision theorist wonât tell you to put in an algorithm that only produces actions that CDT says it should take. This tells me that we really canât fluidly switch back and forth between making claims about the correctness of normative principles and claims about the performance of algorithms, as though there were an accepted one-to-one mapping between these two sorts of claims. Insofar as rationalist writing on decision theory tends to do this sort of switching, I suspect that it contributes to confusion and dismissiveness on the part of many academic readers.
For more on this divide/âpoints of disagreement, see Will MacAskillâs essay on the alignment forum (with responses from MIRI researchers and others)
A third point of tension is the communityâs engagement with normative decision theory research. Different normative decision theories pick out different necessary conditions for an action to be the one that a given person should take, with a focus on how one should respond to uncertainty (rather than on what ends one should pursue).
A typical version of CDT says that the action you should take at a particular point in time is the one that would cause the largest expected increase in value (under some particular framework for evaluating causation). A typical version of EDT says that the action you should take at a particular point in time is the one that would, once you take it, allow you to rationally expect the most value. There are also alternative versions of these theoriesâfor instance, versions using risk-weighted expected value maximization or the criterion of stochastic dominanceâthat break from the use of pure expected value.
Iâve pretty frequently seen it argued within the community (e.g. in the papers âCheating Death in Damascusâ and âFunctional Decision Theoryâ) that CDT and EDT are not âcorrectâ and that some other new theory such as functional decision theory is. But if anti-realism is true, then no decision theory is correct.
Eliezier Yudkowskyâs influential early writing on decision theory seems to me to take an anti-realist stance. It suggests that we can only ask meaningful questions about the effects and correlates of decisions. For example, in the context of the Newcomb thought experiment, we can ask whether one-boxing is correlated with winning more money. But, it suggests, we cannot take a step further and ask what these effects and correlations imply about what it is âreasonableâ for an agent to do (i.e. what they should do). This questionâthe one that normative decision theory research, as I understand it, is generally aboutâis seemingly dismissed as vacuous.
If this apparently anti-realist stance is widely held, then I donât understand why the community engages so heavily with normative decision theory research or why it takes part in discussions about which decision theory is âcorrect.â It strikes me a bit like an atheist enthustiastically following theological debates about which god is the true god. But Iâm mostly just confused here.
In the 80k podcast episode with Hilary Greaves she talks about decision theory and says:
I understand from that that there is little engagement of MIRI with the academia. What is more troubling for me is that it seems that the cases for the major decision theories are looked upon with skepticism from academic experts.
Do you think that is really the case? How do you respond to that? It would personally feel much better if I knew that there are some academic decision theorists who are exited about your research, or a compelling explanation of a systemic failure that explains this which can be applied to MIRIâs work specifically.
[The transition to non-disclosed research happend after the interview]
Yeah, this is an interesting question.
Iâm not really sure whatâs going on here. When I read critiques of MIRI-style decision theories (eg from Will or from Wolfgang Schwartz), I feel very unpersuaded by them. This leaves me in a situation where my inside views disagree with the views of the most obvious class of experts, which is always tricky.
When I read those criticisms by Will MacAskill and Wolfgang Schwartz, I feel like I understand their criticisms and find them unpersuasive, as opposed to not understanding their criticisms. Also, I feel like they donât understand some of the arguments and motivations for FDT. I feel a lot better disagreeing with experts when I think I understand their arguments and when I think I can see particular mistakes that theyâre making. (Itâs not obvious that this is the right epistemic strategy, for reasons well articulated by Gregory Lewis here.)
Paulâs comments on this resolved some of my concerns here. He thinks that the disagreement is mostly about what questions decision theory should be answering. He thinks that the updateless decision theories are obviously more suitable to building AI than eg CDT or EDT.
I think itâs plausible that Paul is being overly charitable to decision theorists; Iâd love to hear whether skeptics of updateless decision theories actually agree that you shouldnât build a CDT agent. (Also, when you ask a CDT agent what kind of decision theory it wants to program into an AI, you get a class of decision theory called âSon of CDTâ, which isnât UDT.)
I think thereâs a systematic pattern where philosophers end up being pretty ineffective at answering the philosophy questions that I care about (based eg on my experience seeing the EA community punch so far above its weight thinking about ethics), and so Iâm not very surprised if it turns out that in this specific case, the philosophy community has priorities that donât match mine.
I think thereâs also a pattern where philosophers have some basic disagreements with me, eg about functionalism and how much math intuitions should feed into our philosophical intuitions. This decision theory disagreement reminds me of that disagreement.
Schwartz has a couple of complaints that the FDT paper doesnât engage properly with the mainstream philosophy literature (eg the Justin Fisher and the David Gauthier papers). My guess is that these complaints are completely legitimate.
On his blog, Scott Aaronson does a good job of describing what I think might be a key difference here:
My guess is that the factor which explains academic unenthusiasm for our work is that decision theorists are more of the âtables and chairs are realâ school than the âequations are realâ schoolâthey arenât as oriented by the question of âhow do I write down a decision theory which would have good outcomes if I created an intelligent agent which used itâ, and they donât have as much of an intuition as I do that that kind of question is fundamentally simple and should have a lot of weight in your choices about how to think about reality.
---
I am really very curious to hear what people (eg edoarad) think of this answer.
FWIW, I could probably be described as a âskepticâ of updateless decision theories; Iâm pretty sympathetic to CDT. But I also donât think we should build AI systems that consistently take the actions recommended by CDT. I know at least a few other people who favor CDT, but again (although small sample size) I donât think any of them advocate for designing AI systems that consistently act in accordance with CDT.
I think the main thing thatâs going on here is that academic decision theorists are primarily interested in normative principles. Theyâre mostly asking the question: âWhat criterion determines whether or not a decision is ârationalâ?â For example, standard CDT claims that an action is rational only if itâs the action that can be expected to cause the largest increase in value.
On the other hand, AI safety researchers seem to be mainly interested in a different question: âWhat sort of algorithm would it be rational for us to build into an AI system?â The first question doesnât seem very relevant to the second one, since the different criteria of rationality proposed by academic decision theorists converge in most cases. For example: No matter whether CDT, EDT, or UDT is correct, it will not typically be rational to build a two-boxing AI system. It seems to me, then, that itâs probably not very pressing for the AI safety community to think about the first question or engage with the academic decision theory literature.
At the same time, though, AI safety writing on decision theory sometimes seems to ignore (or implicitly deny?) the distinction between these two questions. For example: The FDT paper seems to be pitched at philosophers and has an abstract that frames the paper as an exploration of ânormative principles.â I think this understandably leads philosophers to interpret FDT as an attempt to answer the first question and to criticize it on those grounds.
I would go further and say that (so far as I understand the field) most academic decisions theorists arenât at all oriented by this question. I think the question theyâre asking is again mostly independent. Iâm also not sure it would even make sense to talk about âusingâ a âdecision theoryâ in this context, insofar as weâre conceptualizing decision theories the way most academic decision theorists do (as normative principles). Talking about âusingâ CDT in this context is sort of like talking about âusingâ deontology.
[[EDIT: See also this short post for a better description of the distinction between a âcriterion of rightnessâ and a âdecision procedure.â Another way to express my impression of whatâs going on is that academic decision theorists are typically talking about critera of rightness and AI safety decision theorists are typically (but not always) talking about decision procedures.]]
The comments here have been very ecumenical, but Iâd like to propose a different account of the philosophy/âAI divide on decision theory:
1. âWhat makes a decision âgoodâ if the decision happens inside an AI?â and âWhat makes a decision âgoodâ if the decision happens inside a brain?â arenât orthogonal questions, or even all that different; theyâre two different ways of posing the same question.
MIRIâs AI work is properly thought of as part of the âsuccess-first decision theoryâ approach in academic decision theory, described by Greene (2018) (who also cites past proponents of this way of doing decision theory):
The FDT paper does a poor job of contextualizing itself because it was written by AI researchers who are less well-versed with the philosophical literature.
MIRIâs work is both advocating a particular solution to the question âwhat kind of decision theory satisfies the âsuccessâ criterion?â, and lending some additional support to the claim that âsuccess-firstâ is a coherent and reasonable criterion for decision theorists to orient towards. (In a world without ideas like UDT, it was harder to argue that we should try to reduce decision theory to âwhat decision-making approach yields the best utility?â, since neither CDT nor EDT strictly outperforms the other; whereas thereâs a strong case that UDT does strictly outperform both CDT and EDT, to the extent itâs possible for any decision theory to strictly outperform another; though there may be even-better approaches.)
You can go with Paul and say that a lot of these distinctions are semantic rather than substantiveâthat there isnât a true, ultimate, objective answer to the question of whether we should evaluate decision theories by whether theyâre successful, vs. some other criterion. But dissolving contentious arguments and showing why theyâre merely verbal is itself a hallmark of analytic philosophy, so this doesnât do anything to make me think that these issues arenât the proper province of academic decision theory.
2. Rather than operating in separate magisteria, people like Wei Dai are making contrary claims about how humans should make decisions. This is easiest to see in contexts where a future technology comes along: if whole-brain emulation were developed tomorrow and it was suddenly trivial to put CDT proponents in literal twin prisonerâs dilemmas, the CDT recommendation to defect (one-box, etc.) suddenly makes a very obvious and real difference.
I claim (as someone who thinks UDT/âFDT is correct) that the reason it tends to be helpful to think about advanced technologies is that it draws out the violations of naturalism that are often implicit in how we talk about human reasoning. Our native way of thinking about concepts like âcontrol,â âchoice,â and âcounterfactualâ tends to be confused, and bringing in things like predictors and copies of our reasoning draws out those confusions in much the same way that sci-fi thought experiments and the development of new technologies have repeatedly helped clarify confused thinking in philosophy of consciousness, philosophy of personal identity, philosophy of computation, etc.
3. Quoting Paul:
I would argue that most philosophers who feel âtrapped by rationalityâ or âunable to stop doing whatâs âright,â even though they know they âshould,ââ could in fact escape the trap if they saw the flaws in whatever reasoning process led them to their current idea of ârationalityâ in the first place. I think a lot of people are reasoning their way into making worse decisions (at least in the future/âhypothetical scenarios noted above, though I would be very surprised if correct decision-theoretic views had literally no implications for everyday life today) due to object-level misconceptions about the prescriptions and flaws of different decision theories.
And all of this strikes me as very much the bread and butter of analytic philosophy. Philosophers unpack and critique the implicit assumptions in different ways of modeling the world (e.g., âof course I can âcontrolâ physical outcomes but canât âcontrolâ mathematical factsâ, or âof course I can just immediately tell that Iâm in the âreal worldâ; a simulation of me isnât me, or wouldnât be conscious, etc.â). I think MIRI just isnât very good at dialoguing with philosophers, and has had too many competing priorities to put the amount of effort into a scholarly dialogue that I wish were being made.
4. There will obviously be innumerable practical differences between the first AGI systems and human decision-makers. However, putting a huge amount of philosophical weight on this distinction will tend to violate naturalism: ceteris paribus, changing whether you run a cognitive process in carbon or in silicon doesnât change whether the process is doing the right thing or working correctly.
E.g., the rules of arithmetic are the same for humans and calculators, even though we donât use identical algorithms to answer particular questions. Humans tend to correctly treat calculators naturalistically: we often think of them as an extension of our own brains and reasoning, we freely switch back and forth between running a needed computation in our own brain vs. in a machine, etc. Running a decision-making algorithm in your brain vs. in an AI shouldnât be fundamentally different, I claim.
5. For similar reasons, a naturalistic way of thinking about the task âdelegating a decision-making process to a reasoner outside your own brainâ will itself not draw a deep philosophical distinction between âa human building an AI to solve a problemâ and âan AI building a second AI to solve a problemâ or for that matter âan agent learning over time and refining its own reasoning process so it can âdelegateâ to its future selfâ.
There will obviously be practical differences, but there will also be practical differences between two different AI designs. We donât assume that switching to a different design within AI means that the background rules of decision theory (or arithmetic, etc.) go out the window.
(Another way of thinking about this is that the distinction between ânaturalâ and âartificialâ intelligence is primarily a practical and historical one, not one that rests on a deep truth of computer science or rational agency; a more naturalistic approach would think of humans more as a weird special case of the extremely heterogeneous space of â(A)Iâ designs.)
I actually agree with you about this. I have in mind a different distinction, although I might not be explaining it well.
Hereâs another go:
Letâs suppose that some decisions are rational and others arenât. We can then ask: What is it that makes a decision rational? What are the necessary and/âor sufficient conditions? I think that this is the question that philosophers are typically trying to answer. The phrase âdecision theoryâ in this context typically refers to a claim about necessary and/âor sufficient conditions for a decision being rational. To use different jargon, in this context a âdecision theoryâ refers to a proposed âcriterion of rightness.â
When philosophers talk about âCDT,â for example, they are typically talking about a proposed criterion of rightness. Specifically, in this context, âCDTâ is the claim that a decision is rational only if taking it would cause the largest expected increase in value. To avoid any ambiguity, letâs label this claim R_CDT.
We can also talk about âdecision procedures.â A decision procedure is just a process or algorithm that an agent follows when making decisions.
For each proposed criterion of rightness, itâs possible to define a decision procedure that only outputs decisions that fulfill the criterion. For example, we can define P_CDT as a decision procedure that involves only taking actions that R_CDT claims are rational.
My understanding is that when philosophers talk about âCDT,â they primarily have in mind R_CDT. Meanwhile, it seems like members of the rationalist or AI safety communities primarily have in mind P_CDT.
The difference matters, because people who believe that R_CDT is true donât generally believe that we should build agents that implement P_CDT or that we should commit to following P_CDT ourselves. R_CDT claims that we should do whatever will have the best effectsâand, in many cases, building agents that follow a decision procedure other than P_CDT is likely to have the best effects. More generally: Most proposed criteria of rightness imply that it can be rational to build agents that sometimes behave irrationally.
One possible criterion of rightness, which Iâll call R_UDT, is something like this: An action is rational only if it would have been chosen by whatever decision procedure would have produced the most expected value if consistently followed over an agentâs lifetime. For example, this criterion of rightness says that it is rational to one-box in the transparent Newcomb scenario because agents who consistently follow one-boxing policies tend to do better over their lifetimes.
I could be wrong, but I associate the âsuccess-first approachâ with something like the claim that R_UDT is true. This would definitely constitute a really interesting and significant divergence from mainstream opinion within academic decision theory. Academic decision theorists should care a lot about whether or not itâs true.
But Iâm also not sure if it matters very much, practically, whether R_UDT or R_CDT is true. Itâs not obvious to me that they recommend building different kinds of decision procedures into AI systems. For example, both seem to recommend building AI systems that would one-box in the transparent Newcomb scenario.
I disagree that any of the distinctions here are purely semantic. But one could argue that normative anti-realism is true. In this case, there wouldnât really be any such thing as the criterion of rightness for decisions. Neither R_CDT nor R_UDT nor any other proposed criterion would be âcorrect.â
In this case, though, I think there would be even less reason to engage with academic decision theory literature. The literature would be focused on a question that has no real answer.
[[EDIT: Note that Will also emphasizes the importance of the criterion-of-rightness vs. decision-procedure distinction in his critique of the FDT paper: â[T]heyâre [most often] asking what the best decision procedure is, rather than what the best criterion of rightness is⊠But, if thatâs whatâs going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So itâs odd to have a whole paper comparing them side-by-side as if they are rivals.â]]
I agree that these three distinctions are important:
âPicking policies based on whether they satisfy a criterion Xâ vs. âPicking policies that happen to satisfy a criterion Xâ. (E.g., trying to pick a utilitarian policy vs. unintentionally behaving utilitarianly while trying to do something else.)
âTrying to follow a decision rule Y âdirectlyâ or âon the object levelââ vs. âTrying to follow a decision rule Y by following some other decision rule Z that you think satisfies Yâ. (E.g., trying to naĂŻvely follow utilitarianism without any assistance from sub-rules, heuristics, or self-modifications, vs. trying to follow utilitarianism by following other rules or mental habits youâve come up with that you expected to make you better at selecting utilitarianism-endorsed actions.)
âA decision rule that prescribes outputting some action or policy and doesnât care how you do itâ vs. âA decision rule that prescribes following a particular set of cognitive steps that will then output some action or policyâ. (E.g., a rule that says âmaximize the aggregate welfare of moral patientsâ vs. a specific mental algorithm intended to achieve that end.)
The first distinction above seems less relevant here, since weâre mostly discussing AI systems and humans that are self-aware about their decision criteria and explicitly âtrying to do whatâs rightâ.
As a side-note, I do want to emphasize that from the MIRI clusterâs perspective, itâs fine for correct reasoning in AGI to arise incidentally or implicitly, as long as it happens somehow (and as long as the systemâs alignment-relevant properties arenât obscured and the system ends up safe and reliable).
The main reason to work on decision theory in AI alignment has never been âWhat if people donât make AI âdecision-theoreticâ enough?â or âWhat if people mistakenly think CDT is correct and so build CDT into their AI system?â The main reason is that the many forms of weird, inconsistent, and poorly-generalizing behavior prescribed by CDT and EDT suggest that there are big holes in our current understanding of how decision-making works, holes deep enough that weâve even been misunderstanding basic things at the level of âdecision-theoretic criterion of rightnessâ.
Itâs not that I want decision theorists to try to build AI systems (even notional ones). Itâs that there are things that currently seem fundamentally confusing about the nature of decision-making, and resolving those confusions seems like it would help clarify a lot of questions about how optimization works. Thatâs part of why these issues strike me as natural for academic philosophers to take a swing at (while also being continuous with theoretical computer science, game theory, etc.).
The second distinction (âfollowing a rule âdirectlyâ vs. following it by adopting a sub-rule or via self-modificationâ) seems more relevant. You write:
Far from being a distinction proponents of UDT/âFDT neglect, this is one of the main grounds on which UDT/âFDT proponents criticize CDT (from within the âsuccess-firstâ tradition). This is because agents that are reflectively inconsistent in the manner of CDTâones that take actions they know theyâll regret taking, wish they were following a different decision rule, etc. -- can be money-pumped and can otherwise lose arbitrary amounts of value.
A human following CDT should endorse âstop following CDT,â since CDT isnât self-endorsing. Itâs not even that they should endorse âkeep following CDT, but adopt a heuristic or sub-rule that helps us better achieve CDT endsâ; they need to completely abandon CDT even at the meta-level of âwhat sort of decision rule should I follow?â and modify themselves into purely following an entirely new decision rule, or else theyâll continue to perform poorly by CDTâs lights.
The decision rule that CDT does endorse loses a lot of the apparent elegance and naturalness of CDT. This rule, âson-of-CDTâ, is roughly:
Have whatever disposition-to-act gets the most utility, unless Iâm in future situations like âa twin prisonerâs dilemma against a perfect copy of my future self where the copy was forked from me before I started following this ruleâ, in which case ignore my correlation with that particular copy and make decisions as though our behavior is independent (while continuing to take into account my correlation with any copies of myself I end up in prisonerâs dilemmas with that were copied from my brain after I started following this rule).
The fact that CDT doesnât endorse itself (while other theories do), the fact that it needs self-modification abilities in order to perform well by its own lights (and other theories donât), and the fact that the theory it endorses is a strange frankenstein theory (while there are simpler, cleaner theories available) would all be strikes against CDT on their own.
But this decision rule CDT endorses also still performs suboptimally (from the perspective of success-first decision theory). See the discussion of the Retro Blackmail Problem in âToward Idealized Decision Theoryâ, where âCDT and any decision procedure to which CDT would self-modify see losing money to the blackmailer as the best available action.â
In the kind of voting dilemma where a coalition of UDT agents will coordinate to achieve higher-utility outcomes, an agent who became a son-of-CDT agent at age 20 will coordinate with the group insofar as she expects her decision to be correlated with other agentsâ due to events that happened after she turned 20 (such as âthe summer after my 20th birthday, we hung out together and converged a lot in how we think about voting theoryâ). But sheâll refuse to coordinate for reasons like âwe hung out a lot the summer before my 20th birthdayâ, âwe spent our whole childhoods and teen years living together and learning from the same teachersâ, and âwe all have similar decision-making faculties due to being members of the same speciesâ. Thereâs no principled reason to draw this temporal distinction; itâs just an artifact of the fact that we started from CDT, and CDT is a flawed decision theory.
Regarding the third distinction (âprescribing a certain kind of output vs. prescribing a step-by-step mental procedure for achieving that kind of outputâ), Iâd say that itâs primarily the criterion of rightness that MIRI-cluster researchers care about. This is part of why the paper is called âFunctional Decision Theoryâ and not (e.g.) âAlgorithmic Decision Theoryâ: the focus is explicitly on âwhat outcomes do you produce?â, not on how you produce them.
(Thus, an FDT agent can cooperate with another agent whenever the latter agentâs input-output relations match FDTâs prescription in the relevant dilemmas, regardless of what computations they do to produce those outputs.)
The main reasons I think academic decision theory should spend more time coming up with algorithms that satisfy their decision rules are that (a) this has a track record of clarifying what various decision rules actually prescribe in different dilemmas, and (b) this has a track record of helping clarify other issues in the âunderstand what good reasoning isâ project (e.g., logical uncertainty) and how they relate to decision theory.
The second distinction here is most closely related to the one I have in mind, although I wouldnât say itâs the same. Another way to express the distinction I have in mind is that itâs between (a) a normative claim and (b) a process of making decisions.
âHedonistic utilitarianism is correctâ would be a non-decision-theoretic example of (a). âMaking decisions on the basis of coinflipsâ would be an example of (b).
In the context of decision theory, of course, I am thinking of R_CDT as an example of (a) and P_CDT as an example of (b).
I now have the sense Iâm probably not doing a good job of communicating what I have in mind, though.
I guess my view here is that exploring normative claims will probably only be pretty indirectly useful for understanding âhow decision-making works,â since normative claims donât typically seem to have any empirical/âmathematical/âetc. implications. For example, to again use a non-decision-theoretic example, I donât think that learning that hedonistic utilitarianism is true would give us much insight into the computer science or cognitive science of decision-making. Although we might have different intuitions here.
I agree that this is a worthwhile goal and that philosophers can probably contribute to it. I guess Iâm just not sure that the question that most academic decision theorists are trying to answerâand the literature theyâve produced on itâwill ultimately be very relevant.
The fact that R_CDT is âself-effacingââi.e. the fact that it doesnât always recommend following P_CDTâdefinitely does seem like a point of intuitive evidence against R_CDT.
But I think R_UDT also has an important point in its disfavor. It fails to satisfy what might be called the âDonât Make Things Worse Principle,â which says that: Itâs not rational to take decisions that will definitely make things worse. Willâs Bomb case is an example of a case where R_UDT violates the this principle, which is very similar to his âGuaranteed Payoffs Principle.â
Thereâs then a question of which of these considerations is more relevant, when judging which of the two normative theories is more likely to be correct. The failure of R_UDT to satisfy the âDonât Make Things Worse Principleâ seems more important to me, but I donât really know how to argue for this point beyond saying that this is just my intuition. I think that the failure of R_UDT to satisfying this principleâor something like itâis also probably the main reason why many philosophers find it intuitively implausible.
(IIRC the first part of Reasons and Persons is mostly a defense of the view that the correct theory of rationality may be self-effacing. But Iâm not really familiar with the state of arguments here.)
I actually donât think the son-of-CDT agent, in this scenario, will take these sorts of non-causal correlations into account at all. (Modifying just yourself to take non-causual correlations into account wonât cause you to achieve better outcomes here.) So I donât think there should be any weird âFrankensteinâ decision procedure thing going on.
âŠ.Thinking more about it, though, Iâm now less sure how much the different normative decision theories should converge in their recommendations about AI design. I think they all agree that we should build systems that one-box in Newcomb-style scenarios. I think they also agree that, if weâre building twins, then we should design these twins to cooperate in twin prisonerâs dilemmas. But there may be some other contexts where acausal cooperation considerations do lead to genuine divergences. I donât have very clear/âsettled thoughts about this, though.
I think âDonât Make Things Worseâ is a plausible principle at first glance.
One argument against this principle is that CDT endorses following it if you must, but would prefer to self-modify to stop following it (since doing so has higher expected causal utility). The general policy of following the âDonât Make Things Worse Principleâ makes things worse.
Once youâve already adopted son-of-CDT, which says something like âact like UDT in future dilemmas insofar as the correlations were produced after I adopted this rule, but act like CDT in those dilemmas insofar as the correlations were produced before I adopted this ruleâ, itâs not clear to me why you wouldnât just go: âOh. CDT has lost the thing I thought made it appealing in the first place, this âDonât Make Things Worseâ feature. If weâre going to end up stuck with UDT plus extra theoretical ugliness and loss-of-utility tacked on top, then why not just switch to UDT full stop?â
A more general argument against the Bomb intuition pump is that it involves trading away larger amounts of utility in most possible world-states, in order to get a smaller amount of utility in the Bomb world-state. From Abram Demskiâs comments:
And:
This just seems to be the point that R_CDT is self-effacing: It says that people should not follow P_CDT, because following other decision procedures will produce better outcomes in expectation.
I definitely agree that R_CDT is self-effacing in this way (at least in certain scenarios). The question is just whether self-effacingness or failure to satisfy âDonât Make Things Worseâ is more relevant when trying to judge the likelihood of a criterion of rightness being correct. Iâm not sure whether itâs possible to do much here other than present personal intuitions.
The point that R_UDT only violates the âDonât Make Things Worseâ principle only infrequently seems relevant, but Iâm still not sure this changes my intuitions very much.
I may just be missing something, but I donât see what this theoretical ugliness is. And I donât intuitively find the ugliness/âelegance of the decision procedure recommend by a criterion of rightness to be very relevant when trying to judge whether the criterion is correct.
[[EDIT: Just an extra thought on the fact that R_CDT is self-effacing. My impression is that self-effacingness is typically regarded as a relatively weak reason to reject a moral theory. For example, a lot of people regard utilitarianism as self-effacing both because itâs costly to directly evaluate the utility produced by actions and because others often react poorly to people who engage in utilitarian-style reasoning -- but this typically isnât regarded as a slam-dunk reasons to believe that utilitarianism is false. I think the SEP article on consequentialism is expressing a pretty mainstream position when it says: â[T]here is nothing incoherent about proposing a decision procedure that is separate from oneâs criterion of the right.⊠Criteria can, thus, be self-effacing without being self-refuting.â Insofar as people donât tend to buy self-effacingness as a slam-dunk argument against the truth of moral theories, itâs not clear why they should buy it as a slam-dunk argument against the truth of normative decision theories.]]
Sorry to drop in in the middle of this back and forth, but I am curiousâdo you think itâs quite likely that there is a single criterion of rightness that is objectively âcorrectâ?
It seems to me that we have a number of intuitive properties (meta criteria of rightness?) that we would like a criterion of rightness to satisfy (e.g. âdonât make things worseâ, or âdonât be self-effacingâ). And so far there doesnât seem to be any single criterion that satisfies all of them.
So why not just conclude that, similar to the case with voting and Arrowâs theorem, perhaps thereâs just no single perfect criterion of rightness.
In other words, once we agree that CDT doesnât make things worse, but that UDT is better as a general policy, is there anything left to argue about about which is âcorrectâ?
EDIT: Decided I had better go and read your Realism and Rationality post, and ended up leaving a lengthy comment there.
Happy to be dropped in on :)
I think itâs totally conceivable that no criterion of rightness is correct (e.g. because the concept of a âcriterion of rightnessâ turns out to be some spooky bit of nonsense that doesnât really map onto anything in the real world.)
I suppose the main things Iâm arguing are just that:
When a philosopher expresses support for a âdecision theory,â they are typically saying that they believe some claim about what the correct criterion of rightness is.
Claims about the correct criterion of rightness are distinct from decision procedures.
Therefore, when a member of the rationalist community uses the word âdecision theoryâ to refer to a decision procedure, they are talking about something thatâs pretty conceptually distinct from what philosophers typically have in mind. Discussions about what decision procedure performs best or about what decision procedure we should build into future AI systems [[EDIT: or what decision procedure most closely matches our preferences about decision procedures]] donât directly speak to the questions that most academic âdecision theoristsâ are actually debating with one another.
I also think that, conditional on there being a correct criterion of rightness, R_CDT is more plausible than R_UDT. But this is a relatively tentative view. Iâm definitely not a super hardcore R_CDT believer.
I guess hereâin almost definitely too many wordsâis how I think about the issue here. (Hopefully these comments are at least somewhat responsive to your question.)
It seems like following general situation is pretty common: Someone is initially inclined to think that anything with property P will also have property Q1 and Q2. But then they realize that properties Q1 and Q2 are inconsistent with one another.
One possible reaction to this situation is to conclude that nothing actually has property P. Maybe the idea of property P isnât even conceptually coherent and we should stop talking about it (while continuing to independently discuss properties Q1 and Q2). Often the more natural reaction, though, is to continue to believe that some things have property Pâbut just drop the assumption that these things will also have both property Q1 and property Q2.
This obviously a pretty abstract description, so Iâll give a few examples. (No need to read the examples if the point seems obvious.)
Ethics: I might initially be inclined to think that itâs always ethical (property P) to maximize happiness and that itâs always unethical to torture people. But then I may realize that thereâs an inconsistency here: in at least rare circumstances, such as ticking time-bomb scenarios where torture can extract crucial information, there may be no decision that is both happiness maximizing (Q1) and torture-avoiding (Q2). It seems like a natural reaction here is just to drop either the belief that maximizing happiness is always ethical or that torture is always unethical. It doesnât seem like I need to abandon my belief that some actions have the property of being ethical.
Theology: I might initially be inclined to think that God is all-knowing, all-powerful, and all-good. But then I might come to believe (whether rightly or not) that, given the existance of evil, these three properties are inconsistent. I might then continue to believe that God exists, but just drop my belief that God is all-good. (To very awkwardly re-express this in the language of properties: This would mean dropping my belief that any entity that has the property of being God also has the property of being all-good).
Politician-bashing: I might initially be inclined to characterize some politician both as an incompetent leader and as someone whoâs successfully carrying out an evil long-term plan to transform the country. Then I might realize that these two characterizations are in tension with one another. A pretty natural reaction, then, might be to continue to believe the politician existsâbut just drop my belief that theyâre incompetent.
To turn to the case of the decision-theoretic criterion of rightness, I might initially be inclined to think that the correct criterion of rightness will satisfy both âDonât Make Things Worseâ and âNo Self-Effacement.â Itâs now become clear, though, that no criterion of rightness can satisfy both of these principles. I think itâs pretty reasoanble, then, to continue to believe that thereâs a correct criterion of rightnessâbut just drop the belief that the correct criterion of rightness will also satisfy âNo Self-Effacement.â
Thanks! This is helpful.
I think I disagree with the claim (or implication) that keeping P is more often more natural. Well, youâre just saying itâs âoftenâ natural, and I suppose itâs natural in some cases and not others. But I think we may disagree on how often itâs natural, though hard to say at this very abstract level. (Did you see my comment in response to your Realism and Rationality post?)
In particular, Iâm curious what makes you optimistic about finding a âcorrectâ criterion of rightness. In the case of the politician, it seems clear that learning they donât have some of the properties you thought shouldnât call into question whether they exist at all.
But for the case of a criterion of rightness, my intuition (informed by the style of thinking in my comment), is that thereâs no particular reason to think there should be one criterion that obviously fits the bill. Your intuition seems to be the opposite, and Iâm not sure I understand why.
My best guess, particularly informed by reading through footnote 15 on your Realism and Rationality post, is that when faced with ethical dilemmas (like your torture vs lollipop examples), it seems like there is a correct answer. Does that seem right?
(I realize at this point weâre talking about intuitions and priors on a pretty abstract level, so it may be hard to give a good answer.)
Hey again!
I appreciated your comment on the LW post. I started writing up a response to this comment and your LW one, back when the thread was still active, and then stopped because it had become obscenely long. Then I ended up badly needing to procrastinate doing something else today. So hereâs an over-long document I probably shouldnât have written, which you are under no social obligation to read.
Thanks! Just read it.
I think thereâs a key piece of your thinking that I donât quite understand /â disagree with, and itâs the idea that normativity is irreducible.
I think I follow you that if normativity were irreducible, then it wouldnât be a good candidate for abandonment or revision. But that seems almost like begging the question. I donât understand why itâs irreducible.
Suppose normativity is not actually one thing, but is a jumble of 15 overlapping things that sometimes come apart. This doesnât seem like it poses any challenge to your intuitions from footnote 6 in the document (starting with âI personally care a lot about the question: âIs there anything I should do, and, if so, what?ââ). And at the same time it explains why there are weird edge cases where the concept seems to break down.
So few things in life seem to be irreducible. (E.g. neither Eric nor Ben is irreducible!) So why would normativity be?
[You also should feel under no social obligation to respond, though it would be fun to discuss this the next time we find ourselves at the same party, should such a situation arise.]
This is a good discussion! Ben, thank you for inspiring so many of these different paths weâve been going down. :) At some point the hydra will have to stop growing, but I do think the intuitions youâve been sharing are widespread enough that itâs very worthwhile to have public discussion on these points.
On the contrary:
MIRI is more interested in identifying generalizations about good reasoning (âcriteria of rightnessâ) than in fully specifying a particular algorithm.
MIRI does discuss decision algorithms in order to better understand decision-making, but this isnât different in kind from the ordinary way decision theorists hash things out. E.g., the traditional formulation of CDT is underspecified in dilemmas like Death in Damascus. Joyce and Arntzeniusâ response to this wasnât to go âalgorithms are uncouth in our fieldâ; it was to propose step-by-step procedures that they think capture the intuitions behind CDT and give satisfying recommendations for how to act.
MIRI does discuss âwhat decision procedure performs bestâ, but this isnât any different from traditional arguments in the field like ânaive EDT is wrong because it performs poorly in the smoking lesion problemâ. Compared to the average decision theorist, the average rationalist puts somewhat more weight on some considerations and less weight on others, but this isnât different in kind from the ordinary disagreements that motivate different views within academic decision theory, and these disagreements about what weight to give categories of consideration are themselves amenable to argument.
As I noted above, MIRI is primarily interested in decision theory for the sake of better understanding the nature of intelligence, optimization, embedded agency, etc., not for the sake of picking a âdecision theory we should build into future AI systemsâ. Again, this doesnât seem unlike the case of philosophers who think that decision theory arguments will help them reach conclusions about the nature of rationality.
Could you give an example of what the correctness of a meta-criterion like âDonât Make Things Worseâ could in principle consist in?
Iâm not looking here for a âreductionâ in the sense of a full translation into other, simpler terms. I just want a way of making sense of how human brains can tell whatâs âdecision-theoretically normativeâ in cases like this.
Human brains didnât evolve to have a primitive ânormativity detectorâ that beeps every time a certain thing is Platonically Normative. Rather, different kinds of normativity can be understood by appeal to unmysterious matters like âthings brains value as endsâ, âthings that are useful for various endsâ, âthings that accurately map states of affairsâ...
When I think of other examples of normativity, my sense is that in every case thereâs at least one good account of why a human might be able to distinguish âtrulyâ normative things from non-normative ones. E.g. (considering both epistemic and non-epistemic norms):
1. If I discover two alien species who disagree about the truth-value of âcarbon atoms have six protonsâ, I can evaluate their correctness by looking at the world and seeing whether their statement matches the world.
2. If I discover two alien species who disagree about the truth value of âpawns cannot move backwards in chessâ or âthere are statements in the language of Peano arithmetic that can neither be proved nor disproved in Peano arithmeticâ, then I can explain the rules of âproving things about chessâ or âproving things about PAâ as a symbol game, and write down strings of symbols that collectively constitute a âproofâ of the statement in question.
I can then assert that if any member of any species plays the relevant âproofâ game using the same rules, from now until the end of time, they will never prove the negation of my result, and (paper, pen, time, and ingenuity allowing) they will always be able to re-prove my result.
(I could further argue that these symbol games are useful ones to play, because various practical tasks are easier once weâve accumulated enough knowledge about legal proofs in certain games. This usefulness itself provides a criteria for choosing between âfollow through on the proof processâ and âjust start doodling things or writing random letters downâ.)
The above doesnât answer questions like âdo the relevant symbols have Platonic objects as truthmakers or referents?â, or âwhy do we live in a consistent universe?â, or the like. But the above answer seems sufficient for rejecting any claim that thereâs something pointless, epistemically suspect, or unacceptably human-centric about affirming Gödelâs first incompleteness theorem. The above is minimally sufficient grounds for going ahead and continuing to treat math as something more significant than theology, regardless of whether we then go on to articulate a more satisfying explanation of why these symbol games work the way they do.
3. If I discover two alien species who disagree about the truth-value of âsuffering is terminally valuableâ, then I can think of at least two concrete ways to evaluate which parties are correct. First, I can look at the brains of a particular individual or group, see what that individual or group terminally values, and see whether the statement matches whatâs encoded in those brains. Commonly the group I use for this purpose is human beings, such that if an alien (or a housecat, etc.) terminally values suffering, I say that this is âwrongâ.
Alternatively, I can make different âwrongâ predicates for each species: wronghuman, wrongalien1, wrongalien2, wronghousecat, etc.
This has the disadvantage of maybe making it sound like all these values are on âequal footingâ in an internally inconsistent way (âitâs wrong to put undue weight on whatâs wronghuman!â, where the first âwrongâ is secretly standing in for âwronghumanâ), but has the advantage of making it easy to see why the aliensâ disagreement might be important and substantive, while still allowing that aliensâ normative claims can be wrong (because they can be mistaken about their own core values).
The details of how to go from a brain to an encoding of âwhatâs rightâ seem incredibly complex and open to debate, but it seems beyond reasonable dispute that if the information content of a set of terminal values is encoded anywhere in the universe, itâs going to be in brains (or constructs from brains) rather than in patterns of interstellar dust, digits of pi, physical laws, etc.
If a criterion like âDonât Make Things Worseâ deserves a lot of weight, I want to know what that weight is coming from.
If the answer is âI know it has to come from something, but I donât know what yetâ, then that seems like a perfectly fine placeholder answer to me.
If the answer is âThis is like the âterminal valuesâ case, in that (I hypothesize) itâs just an ineradicable component of what humans care aboutâ, then that also seems structurally fine, though Iâm extremely skeptical of the claim that the âwarm glow of feeling causally efficaciousâ is important enough to outweigh other things of great value in the real world.
If the answer is âI think âDonât Make Things Worseâ is instrumentally useful, i.e., more useful than UDT for achieving the other things humans want in lifeâ, then I claim this is just false. But, again, this seems like the right kind of argument to be making; if CDT is better than UDT, then that betterness ought to consist in something.
I mostly agree with this. I think the disagreement between CDT and FDT/âUDT advocates is less about definitions, and more about which of these things feels more compelling:
1. On the whole, FDT/âUDT ends up with more utility.
(I think this intuition tends to hold more force with people the more emotionally salient âmore utilityâ is to you. E.g., consider a version of Newcombâs problem where two-boxing gets you $100, while one-boxing gets you $100,000 and saves your childâs life.)
2. Iâm not the slave of my decision theory, or of the predictor, or of any environmental factor; I can freely choose to do anything in any dilemma, and by choosing to not leave money on the table (e.g., in a transparent Newcomb problem with a 1% chance of predictor failure where Iâve already observed that the second box is empty), Iâm âgetting away with somethingâ and getting free utility that the FDT agent would miss out on.
(I think this intuition tends to hold more force with people the more emotionally salient it is to imagine the dollars sitting right there in front of you and you knowing that itâs âtoo lateâ for one-boxing to get you any more utility in this world.)
There are other considerations too, like how much it matters to you that CDT isnât self-endorsing. CDT prescribes self-modifying in all future dilemmas so that you behave in a more UDT-like way. Itâs fine to say that you personally lack the willpower to follow through once you actually get into the dilemma and see the boxes sitting in front of you; but itâs still the case that a sufficiently disciplined and foresightful CDT agent will generally end up behaving like FDT in the very dilemmas that have been cited to argue for CDT.
If a more disciplined and well-prepared version of you would have one-boxed, then isnât there something off about saying that two-boxing is in any sense âcorrectâ? Even the act of praising CDT seems a bit self-destructive here, inasmuch as (a) CDT prescribes ditching CDT, and (b) realistically, praising or identifying with CDT is likely to make it harder for a human being to follow through on switching to son-of-CDT (as CDT prescribes).
Mind you, if the sentence âCDT is the most rational decision theoryâ is true in some substantive, non-trivial, non-circular sense, then Iâm inclined to think we should acknowledge this truth, even if it makes it a bit harder to follow through on the EDT+CDT+UDT prescription to one-box in strictly-future Newcomblike problems. When the truth is inconvenient, I tend to think itâs better to accept that truth than to linguistically conceal it.
But the arguments Iâve seen for âCDT is the most rational decision theoryâ to date have struck me as either circular, or as reducing to âI know CDT doesnât get me the most utility, but something about it just feels rightâ.
Itâs fine, I think, if âit just feels rightâ is meant to be a promissory note for some forthcoming account â a clue that thereâs some deeper reason to favor CDT, though we havenât discovered it yet. As the FDT paper puts it:
On the other hand, if âit just feels rightâ is meant to be the final word on why âCDT is the most rational decision theoryâ, then I feel comfortable saying that ârationalâ is a poor choice of word here, and neither maps onto a key descriptive category nor maps onto any prescription or norm worthy of being followed.
My impression is that most CDT advocates who know about FDT think FDT is making some kind of epistemic mistake, where the most popular candidate (I think) is some version of magical thinking.
Superstitious people often believe that itâs possible to directly causally influence things across great distances of time and space. At a glance, FDTâs prescription (âone-box, even though you canât causally affect whether the box is fullâ) as well as its account of how and why this works (âyou can somehow âcontrolâ the properties of abstract objects like âdecision functionsââ) seem weird and spooky in the manner of a superstition.
FDTâs response: if a thing seems spooky, thatâs a fine first-pass reason to be suspicious of it. But at some point, the accusation of magical thinking has to cash out in some sort of practical, real-world failureâin the case of decision theory, some systematic loss of utility that isnât balanced by an equal, symmetric loss of utility from CDT. After enough experience of seeing a tool outperforming the competition in scenario after scenario, at some point calling the use of that tool âmagical thinkingâ starts to ring rather hollow. At that point, itâs necessary to consider the possibility that FDT is counter-intuitive but correct (like Einsteinâs âspukhafte Fernwirkungâ), rather than magical.
In turn, FDT advocates tend to think the following reflects an epistemic mistake by CDT advocates:
The alleged mistake here is a violation of naturalism. Humans tend to think of themselves as free Cartesian agents acting upon the world, rather than as deterministic subprocesses of a larger deterministic process. If we consistently and whole-heartedly accepted the âdeterministic subprocessâ view of our decision-making, we would find nothing strange about the idea that itâs sometimes right for this subprocess to do locally incorrect things for the sake of better global results.
E.g., consider the transparent Newcomb problem with a 1% chance of predictor error. If we think of the brainâs decision-making as a rule-governed system whose rules we are currently determining (via a meta-reasoning process that is itself governed by deterministic rules), then thereâs nothing strange about enacting a rule that gets us $1M in 99% of outcomes and $0 in 1% of outcomes; and following through when the unlucky 1% scenario hits us is nothing to agonize over, itâs just a consequence of the rule we already decided. In that regard, steering the rule-governed system that is your brain is no different than designing a factory robot that performs well enough in 99% of cases to offset the 1% of cases where something goes wrong.
(Note how a lot of these points are more intuitive in CS language. I donât think itâs a coincidence that people coming from CS were able to improve on academic decision theoryâs ideas on these points; I think itâs related to what kinds of stumbling blocks get in the way of thinking in these terms.)
Suppose you initially tell yourself:
Suppose that you then find yourself facing the 1%-likely outcome where Omega leaves the box empty regardless of your choice. You then have a change of heart and decide to two-box after all, taking the $1000.
I claim that the above description feels from the inside like your brain is escaping the iron chains of determinism (even if your scientifically literate system-2 verbal reasoning fully recognizes that youâre a deterministic process). And I claim that this feeling (plus maybe some reluctance to fully accept the problem description as accurate?) is the only thing that makes CDTâs decision seem reasonable in this case.
In reality, however, if we end up not following through on our verbal commitment and we one-box in that 1% scenario, then this would just prove that weâd been mistaken about what rule we had successfully installed in our brains. As it turns out, we were really following the lower-global-utility rule from the outset. A lack of follow-through or a failure of will is itself a part of the decision-making process that Omega is predicting; however much it feels as though a last-minute swerve is you âgetting away with somethingâ, itâs really just you deterministically following through on an algorithm that will get you less utility in 99% of scenarios (while happening to be bad at predicting your own behavior and bad at following through on verbalized plans).
I should emphasize that the above is my own attempt to characterize the intuitions behind CDT and FDT, based on the arguments Iâve seen in the wild and based on what makes me feel more compelled by CDT, or by FDT. I could easily be wrong about the crux of disagreement between some CDT and FDT advocates.
Is the following a roughly accurate re-characterization of the intuition here?
âSuppose that thereâs an agent that implements P_UDT. Because it is following P_UDT, when it enters the box room it finds a ton of money in the first box and then refrains from taking the money in the second box. People who believe R_CDT claim that the agent should have also taken the money in the second box. But, given that the universe is deterministic, this doesnât really make sense. From before the moment the agent the room, it was already determined that the agent would one box. Since (in a physically determinstic sense) the P_UDT agent could not have two-boxed, thereâs no relevant sense in which the agent should have two-boxed.â
If so, then I suppose my first reaction is that this seems like a general argument against normative realism rather than an argument against any specific proposed criterion of rightness. It also applies, for example, to the claim that a P_CDT agent âshould haveâ one-boxedâsince in a physically deterministic sense it could not have. Therefore, I think itâs probably better to think of this as an argument against the truth (and possibly conceptual coherence) of both R_CDT and R_UDT, rather than an argument that favors one over the other.
In general, it seems to me like all statements that evoke counterfactuals have something like this problem. For example, it is physically determined what sort of decision procedure we will build into any given AI system; only one choice of decision procedure is physically consistent with the state of the world at the time the choice is made. Soâinsofar as we accept this kind of objection from determinismâthere seems to be something problematically non-naturalistic about discussing what âwould have happenedâ if we built in one decision procedure or another.
No, I donât endorse this argument. To simplify the discussion, letâs assume that the Newcomb predictor is infallible. FDT agents, CDT agents, and EDT agents each get a decision: two-box (which gets you $1000 plus an empty box), or one-box (which gets you $1,000,000 and leaves the $1000 behind). Obviously, insofar as they are in fact following the instructions of their decision theory, thereâs only one possible outcome; but it would be odd to say that a decision stops being a decision just because itâs determined by something. (Whatâs the alternative?)
I do endorse âgiven the predictorâs perfect accuracy, itâs impossible for the P_UDT agent to two-box and come away with $1,001,000â. I also endorse âgiven the predictorâs perfect accuracy, itâs impossible for the P_CDT agent to two-box and come away with $1,001,000âł. Per the problem specification, no agent can two-box and get $1,001,000 or one-box and get $0. But this doesnât mean that no decision is made; it just means that the predictor can predict the decision early enough to fill the boxes accordingly.
(Notably, the agent following P_CDT two-boxes because $1,001,000 > $1,000,000 and $1000 > $0, even though this âdominanceâ argument appeals to two outcomes that are known to be impossible just from the problem statement. I certainly donât think agents âshouldâ try to achieve outcomes that are impossible from the problem specification itself. The reason agents get more utility than CDT in Newcombâs problem is that non-CDT agents take into account that the predictor is a predictor when they construct their counterfactuals.)
In the transparent version of this dilemma, the agent who sees the $1M and one-boxes also âcould have two-boxedâ, but if they had two-boxed, it would only have been after making a different observation. In that sense, if the agent has any lingering uncertainty about what theyâll choose, the uncertainty goes away as soon as they see whether the box is full.
No, thereâs nothing non-naturalistic about this. Consider the scenario you and I are in. Simplifying somewhat, we can think of ourselves as each doing meta-reasoning to try to choose between different decision algorithms to follow going forward; where the new things we learn in this conversation are themselves a part of that meta-reasoning.
The meta-reasoning process is deterministic, just like the object-level decision algorithms are. But this doesnât mean that we canât choose between object-level decision algorithms. Rather, the meta-reasoning (in spite of having deterministic causes) chooses either âI think Iâll follow P_FDT from now onâ or âI think Iâll follow P_CDT from now onâ. Then the chosen decision algorithm (in spite of also having deterministic causes) outputs choices about subsequent actions to take. Meta-processes that select between decision algorithms (to put into an AI, or to run in your own brain, or to recommend to other humans, etc.)) can make âreal decisionsâ, for exactly the same reason (and in exactly the same sense) that the decision algorithms in question can make real decisions.
It isnât problematic that all these processes requires us to consider counterfactuals that (if we were omniscient) we would perceive as inconsistent/âimpossible. Deliberation, both at the object level and at the meta level, just is the process of determining the unique and only possible decision. Yet because we are uncertain about the outcome of the deliberation while deliberating, and because the details of the deliberation process do determine our decision (even as these details themselves have preceding causes), it feels from the inside of this process as though both options are âliveâ, are possible, until the very moment we decide.
(See also Decisions are for making bad outcomes inconsistent.)
I think you need to make a clearer distinction here between âoutcomes that donât exist in the universeâs dynamicsâ (like taking both boxes and receiving $1,001,000) and âoutcomes that canât exist in my branchâ (like there not being a bomb in the unlucky case). Because if youâre operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations).
And, to be clear, I do think agents âshouldâ try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents âshouldâ try to achieve outcomes that are already known to have happened from the problem specification including observations.
As an example, if youâre in Parfitâs Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where youâre deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.
+1, I agree with all this.
Suppose that we accept the principle that agents never âshouldâ try to achieve outcomes that are impossible from the problem specificationâwith one implication being that itâs false that (as R_CDT suggests) agents that see a million dollars in the first box âshouldâ two-box.
This seems to imply that itâs also false that (as R_UDT suggests) an agent that sees that the first box is empty âshouldâ one box. By the problem specification, of course, one boxing when there is no money in the first box is also an impossible outcome. Since decisions to two box only occur when the first box is empty, this would then imply that decisions to two box are never irrational in the context of this problem. But I imagine you donât want to say that.
I think I probably still donât understand your objection hereâso Iâm not sure this point is actually responsive to itâbut I initially have trouble seeing what potential violations of naturalism/âdeterminism R_CDT could be committing that R_UDT would not also be committing.
(Of course, just to be clear, both R_UDT and R_CDT imply that the decision to commit yourself to a one-boxing policy at the start of the game would be rational. They only diverge in their judgments of what actual in-room boxing decision would be rational. R_UDT says that the decision to two-box is irrational and R_CDT says that the decision to one-box is irrational.)
That should be âa one-boxing policyâ, right?
Yep, thanks for the catch! Edited to fix.
It seems to me like theyâre coming down to saying something like: the âGuaranteed Payoffs Principleâ /â âDonât Make Things Worse Principleâ is more core to rational action than being self-consistent. Whereas others think self-consistency is more important.
Itâs not clear to me that the justification for CDT is more circular than the justification for FDT. Doesnât it come down to which principles you favor?
Maybe you could say FDT is more elegant. Or maybe that it satisfies more of the intuitive properties weâd hope for from a decision theory (where elegance might be one of those). But Iâm not sure that would make the justification less-circular per se.
I guess one way the justification for CDT could be more circular is if the key or only principle that pushes in favor of it over FDT can really just be seen as a restatement of CDT in a way that the principles that push in favor of FDT do not. Is that what you would claim?
The main argument against CDT (in my view) is that it tends to get you less utility (regardless of whether you add self-modification so it can switch to other decision theories). Self-consistency is a secondary issue.
FDT gets you more utility than CDT. If you value literally anything in life more than you value âwhich ritual do I use to make my decisions?â, then you should go with FDT over CDT; thatâs the core argument.
This argument for FDT would be question-begging if CDT proponents rejected utility as a desirable thing. But instead CDT proponents who are familiar with FDT agree utility is a positive, and either (a) they think thereâs no meaningful sense in which FDT systematically gets more utility than CDT (which I think is adequately refuted by Abram Demski), or (b) they think that CDT has other advantages that outweigh the loss of utility (e.g., CDT feels more intuitive to them).
The latter argument for CDT isnât circular, but as a fan of utility (i.e., of literally anything else in life), it seems very weak to me.
I do think the argument ultimately needs to come down to an intuition about self-effacingness.
The fact that agents earn less expected utility if they implement P_CDT than if they implement some other decision procedure seems to support the claim that agents should not implement P_CDT.
But thereâs nothing logically inconsistent about believing both (a) that R_CDT is true and (b) that agents should not implement P_CDT. To again draw an analogy with a similar case, thereâs also nothing logically inconsistent about believing both (a) that utilitarianism is true and (b) that agents should not in general make decisions by carrying out utilitarian reasoning.
So why shouldnât I believe that R_CDT is true? The argument needs an additional step. And it seems to me like the most addition step here involves an intuition that the criterion of rightness would not be self-effacing.
More formally, it seems like the argument needs to be something along these lines:
Over their lifetimes, agents who implement P_CDT earn less expected utility than agents who implement certain other decision procedures.
(Assumption) Agents should implement whatever decision procedure will earn them the most expected lifetime utility.
Therefore, agents should not implement P_CDT.
(Assumption) The criterion of rightness is not self-effacing. Equivalently, if agents should not implement some decision procedure P_X, then it is not the case that R_X is true.
Thereforeâas an implication of points (3) and (4) -- R_CDT is not true.
Whether you buy the âNo Self-Effacementâ assumption in Step 4 -- or, alternatively, the countervailing âDonât Make Things Worseâ assumption that supports R_CDTâseems to ultimately be a mattter of intuition. At least, I donât currently know what else people can appeal to here to resolve the disagreement.
[[SIDENOTE: Step 2 is actually a bit ambiguous, since it doesnât specify how expected lifetime utility is being evaluated. For example, are we talking about expected lifetime utility from a causal or evidential perspective? But I donât think this ambiguity matters much for the argument.]]
[[SECOND SIDENOTE: Iâm using the phrase âself-effacingâ rather than âself-contradictoryâ here, because I think itâs more standard and because âself-contradictoryâ seems to suggest logical inconsistency.]]
If the thing being argued for is âR_CDT plus P_SONOFCDTâ, then that makes sense to me, but is vulnerable to all the arguments Iâve been making: Son-of-CDT is in a sense the worst of both worlds, since it gets less utility than FDT and lacks CDTâs âDonât Make Things Worseâ principle.
If the thing being argued for is âR_CDT plus P_FDTâ, then I donât understand the argument. In what sense is P_FDT compatible with, or conducive to, R_CDT? What advantage does this have over âR_FDT plus P_FDTâ? (Indeed, what difference between the two views would be intended here?)
The argument against âR_CDT plus P_SONOFCDTâ doesnât require any mention of self-effacingness; itâs entirely sufficient to note that P_SONOFCDT gets less utility than P_FDT.
The argument against âR_CDT plus P_FDTâ seems to demand some reference to self-effacingness or inconsistency, or triviality /â lack of teeth. But I donât understand what this view would mean or why anyone would endorse it (and I donât take you to be endorsing it).
We want to evaluate actual average utility rather than expected utility, since the different decision theories are different theories of what âexpected utilityâ means.
Hm, I think I may have misinterpretted your previous comment as emphasizing the point that P_CDT âgets you less utilityâ rather than the point that P_SONOFCDT âgets you less utility.â So my comment was aiming to explain why I donât think the fact that P_CDT gets less utility provides a strong challenge to the claim that R_CDT is true (unless we accept the âNo Self-Effacement Principleâ). But it sounds like you might agree that this fact doesnât on its own provide a strong challenge.
In response to the first argument alluded to here: âGets the most [expected] utilityâ is ambiguous, as I think weâve both agreed.
My understanding is that P_SONOFCDT is definitionally the policy that, if an agent decided to adopt it, would cause the largest increase in expected utility. Soâif we evaluate the expected utility of a decision to adopt a policy from a casual perspectiveâit seems to me that P_SONOFCDT âgets the most expected utility.â
If we evaluate the expected utility of a policy from an evidential or subjunctive perspective, however, then another policy may âget the most utilityâ (because policy adoption decisions may be non-causally correlated.)
Apologies if Iâm off-base, but it reads to me like you might be suggesting an argument along these lines:
R_CDT says that it is rational to decide to follow a policy that would not maximize âexpected utilityâ (defined in evidential/âsubjunctive terms).
(Assumption) But it is not rational to decide to follow a policy that would not maximize âexpected utilityâ (defined in evidential/âsubjunctive terms).
Therefore R_CDT is not true.
The natural response to this argument is that itâs not clear why we should accept the assumption in Step 2. R_CDT says that the rationality of a decision depends on its âexpected utilityâ defined in causal terms. So someone starting from the position that R_CDT is true obviously wonât accept the assumption in Step 2. R_EDT and R_FDT say that the rationality of a decision depends on its âexpected utilityâ defined in evidential or subjunctive terms. So we might allude to R_EDT or R_FDT to justify the assumption, but of course this would also mean arguing backwards from the conclusion that the argument is meant to reach.
Overall at least this particular simple argumentâthat R_CDT is false because P_SONOFCDT gets less âexpected utilityâ as defined in evidential/âquasi-evidential termsâwould seemingly fail to due circularity. But you may have in mind a different argument.
I felt confused by this comment. Doesnât even R_FDT judge the rationality of a decision by its expected value (rather than its actual value)? And presumably you donât want to say that someone who accepts unpromising gambles and gets lucky (ending up with high actual average utility) has made more ârationalâ decisions than someone who accepts promising gambles and gets unlucky (ending up with low actual average utility)?
You also correctly point out that the decision procedure that R_CDT implies agents should rationally commit toâP_SONOFCDTâsometimes outputs decisions that definitely make things worse. So âDonât Make Things Worseâ implies that some of the decisions outputted by P_SONOFCDT are irrational.
But I still donât see what the argument is here unless weâre assuming âNo Self-Effacement.â It still seems to me like we have a few initial steps and then a missing piece.
(Observation) R_CDT implies that it is rational to commit to following the decision procedure P_SONOFCDT.
(Observation) P_SONOFCDT sometimes outputs decisions that definitely make things worse.
(Assumption) It is irrational to take decisions that definitely make things worse. In other words, the âDonât Make Things Worseâ Principle is true.
Therefore, as an implication of Step 2 and Step 3, P_SONOFCDT sometimes outputs irrational decisions.
???
Therefore, R_CDT is false.
The âNo Self-Effacementâ Principle is equivalent to the principle that: If a criterion of rightness implies that it is rational to commit to a decision procedure, then that decision procedure only produces rational actions. So if we were to assume âNo Self-Effacementâ in Step 5 then this would allow us to arrive at the conclusion that R_CDT is false. But if weâre not assuming âNo Self-Effacement,â then itâs not clear to me how we get there.
Actually, in the context of this particular argument, I suppose we donât really have the option of assuming that âNo Self-Effacementâ is trueâbecause this assumption would be inconsistent with the earlier assumption that âDonât Make Things Worseâ is true. So Iâm not sure itâs actually possible to make this argument schema work in any case.
There may be a pretty different argument here, which you have in mind. I at least donât see it yet though.
Perhaps the argument is something like:
âDonât make things worseâ (DMTW) is one of the intuitions that leads us to favoring R_CDT
But the actual policy that R_CDT recommends does not in fact follow DMTW
So R_CDT only gets intuitive appeal from DMTW to the extent that DMTW was about R_âČs, and not about P_âČs
But intuitions are probably(?) not that precisely targeted, so R_CDT shouldnât get to claim the full intuitive endorsement of DMTW. (Yes, DMTW endorses it more than it endorses R_FDT, but R_CDT is still at least somewhat counter-intuitive when judged against the DMTW intuition.)
Here are two logically inconsistent principles that could be true:
Donât Make Things Worse: If a decision would definitely make things worse, then taking that decision is not rational.
Donât Commit to a Policy That In the Future Will Sometimes Make Things Worse: It is not rational to commit to a policy that, in the future, will sometimes output decisions that definitely make things worse.
I have strong intuitions that the fist one is true. I have much weaker (comparatively neglible) intuitions that the second one is true. Since theyâre mutually inconsistent, I reject the second and accept the first. I imagine this is also true of most other people who are sympathetic to R_CDT.
One could argue that R_CDT sympathists donât actually have much stronger intuitions regarding the first principle than the secondâi.e. that their intuitions arenât actually very âtargetedâ on the first oneâbut I donât think that would be right. At least, itâs not right in my case.
A more viable strategy might be to argue for something like a meta-principle:
The âDonât Make Things Worseâ Meta-Principle: If you find âDonât Make Things Worseâ strongly intuitive, then you should also find âDonât Commit to a Policy That In the Future Will Sometimes Make Things Worseâ just about as intuitive.
If the meta-principle were true, then I guess this would sort of imply that peopleâs intuitions in favor of âDonât Make Things Worseâ should be self-neutralizing. They should come packaged with equally strong intuitions for another position that directly contradicts it.
But I donât see why the meta-principle should be true. At least, my intuitions in favor of the meta-principle are way less strong than my intutions in favor of âDonât Make Things Worseâ :)
Just to say slightly more on this, I think the Bomb case is again useful for illustrating my (I think not uncommon) intuitions here.
Bomb Case: Omega puts a million dollars in a transparent box if he predicts youâll open it. He puts a bomb in the transparent box if he predicts you wonât open it. Heâs only wrong about one in a trillion times.
Now suppose you enter the room and see that thereâs a bomb in the box. You know that if you open the box, the bomb will explode and you will die a horrible and painful death. If you leave the room and donât open the box, then nothing bad will happen to you. Youâll return to a grateful family and live a full and healthy life. You understand all this. You want so badly to live. You then decide to walk up to the bomb and blow yourself up.
Intuitively, this decision strikes me as deeply irrational. Youâre intentionally taking an action that you know will cause a horrible outcome that you want badly to avoid. It feels very relevant that youâre flagrantly violating the âDonât Make Things Worseâ principle.
Now, letâs step back a time step. Suppose you know that youâre sort of person who would refuse to kill yourself by detonating the bomb. You might decide thatâsince Omega is such an accurate predictorâitâs worth taking a pill to turn you into that sort of person, to increase your odds of getting a million dollars. You recognize that this may lead you, in the future, to take an action that makes things worse in a horrifying way. But you calculate that the decision youâre making now is nonetheless making things better in expectation.
This decision strikes me as pretty intuitively rational. Youâre violating the second principleâthe âDonât Commit to a Policy...â Principleâbut this violation just doesnât seem that intuitively relevent or remarkable to me. I personally feel like there is nothing too odd about the idea that it can be rational to commit to violating principles of rationality in the future.
(This obviously just a description of my own intuitions, as they stand, though.)
By triggering the bomb, youâre making things worse from your current perspective, but making things better from the perspective of earlier you. Doesnât that seem strange and deserving of an explanation? The explanation from a UDT perspective is that by updating upon observing the bomb, you actually changed your utility function. You used to care about both the possible worlds where you end up seeing a bomb in the box, and the worlds where you donât. After updating, you think youâre either a simulation within Omegaâs prediction so your action has no effect on yourself or youâre in the world with a real bomb, and you no longer care about the version of you in the world with a million dollars in the box, and this accounts for the conflict/âinconsistency.
Giving the human tendency to change our (UDT-)utility functions by updating, itâs not clear what to do (or what is right), and I think this reduces UDTâs intuitive appeal and makes it less of a slam-dunk over CDT/âEDT. But it seems to me that it takes switching to the UDT perspective to even understand the nature of the problem. (Quite possibly this isnât adequately explained in MIRIâs decision theory papers.)
I would agree that, with these two principles as written, more people would agree with the first. (And certainly believe you that thatâs right in your case.)
But I feel like the second doesnât quite capture what I had in mind regarding the DMTW intuition applied to P_âČs.
Consider an alternate version:
Or alternatively:
It seems to me that these two claims are naively intuitive on their face, in roughly the same way that the â⊠then taking that decision is not rational.â version is. And itâs only after youâve considered prisonersâ dilemmas or Newcombâs paradox, etc. that you realize that good policy (or being a rational agent) actually diverges from whatâs rational in the moment.
(But maybe others would disagree on how intuitive these versions are.)
EDIT: And to spell out my argument a bit more: if several alternate formulations of a principle are each intuitively appealing, and it turns out that whether some claim (e.g. R_CDT is true) is consistent with the principle comes down to the precise formulation used, then itâs not quite fair to say that the principle fully endorses the claim and that the claim is not counter-intuitive from the perspective of the original intuition.
Of course, this argument is moot if itâs true that the original DMTW intuition was always about rational in-the-moment action, and never about policies or actors. And maybe thatâs the case? But I think itâs a little more ambiguous with the â⊠is not good policyâ or âa rational person would not...â versions than with the âDonât commit to a policy...â version.
EDIT2: Does what Iâm trying to say make sense? (I felt like I was struggling a bit to express myself in this comment.)
Just as a quick sidenote:
Iâve been thinking of P_SONOFCDT as, by definition, the decision procedure that R_CDT implies that it is rational to commit to implementing.
If we define P_SONOFCDT this way, then anyone who believes that R_CDT is true must also believe that it is rational to implement P_SONOFCDT.
The belief that R_CDT is true and the belief that it is rational to implement P_FDT would only then be consistent if P_SONOFCDT is equivalent to P_FDT (which of course they arenât). So I would inclined to say that no one should believe in both the correctness of R_CDT and the rationality of implementing P_FDT.
[[EDIT: Actually, I need to distinguish between the decision procedure that it would be rational commit to yourself and the decision procedure that it would be rational to build into an agents. These can sometimes be different. For example, suppose that R_CDT is true and that youâre building twin AI systems and you would like them both to succeed. Then it would be rational for you to give them decision procedures that will cause them to cooperate if they face each other in a prisonerâs dilemma (e.g. some version of P_FDT). But if R_CDT is true and youâve just been born into the world as one of the twins, it would be rational for you to commit to a decision procedure that would cause you to defect if you face the other AI system in a prisonerâs dilemma (i.e. P_SONOFCDT). I slightly edited the above comment to reflect this. My tentative viewâwhich Iâve alluded to aboveâis that the various proposed criteria of rightness donât in practice actually diverge all that much when it comes to the question of what sorts of decision procedures we should build into AI systems. Although I also understand that MIRI is not mainly interested in the question of what sorts of decision procedures we should build into AI systems.]]
Do you mean
It seems to better fit the pattern of the example just prior.
This is similar to how you described it here:
This seems like it should instead be a 2x2 grid: something can be either normative or non-normative, and if itâs normative, it can be either an algorithm/âprocedure thatâs being recommended, or a criterion of rightness like âa decision is rational iff taking it would cause the largest expected increase in valueâ (which we can perhaps think of as generalizing over a set of algorithms, and saying all the algorithms in a certain set are ânormativeâ or âendorsedâ).
Some of your discussion above seems to be focusing on the âalgorithmic?â dimension, while other parts seem focused on ânormative?â. Iâll say more about ânormative?â here.
The reason I proposed the three distinctions in my last comment and organized my discussion around them is that I think theyâre pretty concrete and crisply defined. Itâs harder for me to accidentally switch topics or bundle two different concepts together when talking about âtrying to optimize vs. optimizing as a side-effectâ, âdirectly optimizing vs. optimizing via heuristicsâ, âinitially optimizing vs. self-modifying to optimizeâ, or âfunction vs. algorithmâ.
In contrast, I think ânormativeâ and ârationalâ can mean pretty different things in different contexts, itâs easy to accidentally slide between different meanings of them, and their abstractness makes it easy to lose track of whatâs at stake in the discussion.
E.g., ânormativeâ is often used in the context of human terminal values, and itâs in this context that statements like this ring obviously true:
If weâre treating decision-theoretic norms as being like moral norms, then sure. I think there are basically three options:
Decision theory isnât normative.
Decision theory is normative in the way that âmurder is badâ or âimproving aggregate welfare is goodâ is normative, i.e., it expresses an arbitrary terminal value of human beings.
Decision theory is normative in the way that game theory, probability theory, Boolean logic, the scientific method, etc. are normative (at least for beings that want accurate beliefs); or in the way that the rules and strategies of chess are normative (at least for beings that want to win at chess); or in the way that medical recommendations are normative (at least for beings that want to stay healthy).
Probability theory has obvious normative force in the context of reasoning and decision-making, but itâs not therefore arbitrary or irrelevant to understanding human cognition, AI, etc.
A lot of the examples youâve cited are theories from moral philosophy about whatâs terminally valuable. But decision theory is generally thought of as the study of how to make the right decisions, given a set of terminal preferences; itâs not generally thought of as the study of which decision-making methods humans happen to terminally prefer to employ. So I would put it in category 1 or 3.
You could indeed define an agent that terminally values making CDT-style decisions, but I donât think most proponents of CDT or EDT would claim that their disagreement with UDT/âFDT comes down to a values disagreement like that. Rather, theyâd claim that rival decision theorists are making some variety of epistemic mistake. (And I would agree that the disagreement comes down to one party or the other making an epistemic mistake, though I obviously disagree about whoâs mistaken.)
In the twin prisonerâs dilemma with son-of-CDT, both agents are following son-of-CDT and neither is following CDT (regardless of whether the fork happened before or after the switchover to son-of-CDT).
I think you can model the voting dilemma the same way, just with noise added because the level of correlation is imperfect and/âor uncertain. Ten agents following the same decision procedure are trying to decide whether to stay home and watch a movie (which gives a small guaranteed benefit) or go to the polls (which costs them the utility of the movie, but gains them a larger utility iff the other nine agents go to the polls too). Ten FDT agents will vote in this case, if they know that the other agents will vote under similar conditions.
[[Disclaimer: Iâm not sure this will be useful, since it seems like most of discussions that verge on meta-ethics end up with neither side properly understanding the other.]]
I think the kind of decision theory that philosophers tend to work on is typically explicitly described as ânormative.â (For example, the SEP article on decision theory is about ânormative decision theory.â) So when Iâm talking about âacademic decision theoriesâ or âproposed criteria of rightnessâ Iâm talking about normative theories. When I use the word ârationalâ Iâm also referring to a normative property.
I donât think thereâs any very standard definition of what it means for something to be normative, maybe because itâs often treated as something pretty close to a primitive concept, but a partial account is that a ânormative theoryâ is a claim about what someone should do. At least this is what I have in mind. This is different from the second option you list (and I think the third one).
Some normative theories concern âends.â These are basically claims about what people should do, if they can freely choose outcomes. For example: A subjectivist theory might say that people should maximize the fulfillment of their own personal preferences (whatever they are). Whereas a hedonistic utilitarian theory might say that people should should maximize total happiness. Iâm not sure what the best terminology is, and think this choice is probably relatively non-standard, but letâs label these âmoral theories.â
Some normative theories, including âdecision theories,â concern âmeans.â These theories put aside the question of which ends people should pursue and instead focus on how people should respond to uncertainty about the results/âimplications of their actions. For example: Expected utility theory says that people should take whatever actions maximize expected fulfillment of the relevant ends. Risk-weighted expected utility theory (and other alternative theories) say different things. Typical versions of CDT and EDT flesh out expected utility theory in different ways to specify what the relevant measure of âexpected fulfillmentâ is.
Moral theory and normative decision theory seem to me to have pretty much the same status. They are both bodies of theory that bear on what people should do. On some views, the division between them is more a matter of analytic convenience than anything else. For example, David Enoch, a prominent meta-ethicist, writes: âIn fact, I think that for most purposes [the line between the moral and the non-moral] is not a line worth worrying about. The distinction within the normative between the moral and the non-moral seems to me to be shallow compared to the distinction between the normative and the non-normativeâ (Taking Morality Seriously, 86).
One way to think of moral theories and normative decision theories is as two components that fit together to form more fully specified theories about what people should do. Moral theories describe the ends people should pursue; given these ends, decision theories then describe what actions people should take when in states of uncertainty. To illustrate, two examples of more complete normative theories that combine moral and decision-theoretic components would be: âYou should take whatever action would in expectation cause the largest increase in the fulfillment of your preferencesâ and âYou should take whatever action would, if you took it, lead you to anticipate the largest expected amount of future happiness in the world.â The first is subjectivism combined with CDT, while the second is total view hedonistic utilitarianism combined with EDT.
(On this conception, a moral theory is not a description of âan arbitrary terminal value of human beings.â Decision theory here also is not âthe study of which decision-making methods humans happen to terminally prefer to employ.â These are both theories are about what people should do, rather than theories about about what peopleâs preferences are.)
Normativity is obviously pretty often regarded as a spooky or insufficiently explained thing. So a plausible position is normative anti-realism: It might be the case that no normative claims are true, either because theyâre all false or because theyâre not even well-formed enough to take on truth values. If normative anti-realism is true, then one thing this means is that the philosophical decision theory community is mostly focused on a question that doesnât really have an answer.
If Iâm someone with a twin and Iâm implementing P_CDT, I still donât think I will choose to modify myself to cooperate in twin prisonerâs dilemmas. The reason is that modifying myself wonât cause my twin to cooperate; it will only cause me to cooperate, lowering the utility I receive.
(The fact P_CDT agents wonât modify themselves to cooperate with their twins could of course be interpretted as a mark against R_CDT.)
I appreciate you taking the time to lay out these background points, and it does help me better understand your position, Ben; thanks!
Some ancient Greeks thought that the planets were intelligent beings; yet many of the Greeksâ astronomical observations, and some of their theories and predictive tools, were still true and useful.
I think that terms like ânormativeâ and ârationalâ are underdefined, so the question of realism about them is underdefined (cf. Luke Muehlhauserâs pluralistic moral reductionism).
I would say that (1) some philosophers use ârationalâ in a very human-centric way, which is fine as long as itâs done consistently; (2) others have a much more thin conception of ârationalâ, such as âtending to maximize utilityâ; and (3) still others want to have their cake and eat it too, building in a lot of human-value-specific content to their notion of ârationalityâ, but then treating this conception as though it had the same level of simplicity, naturalness, and objectivity as 2.
I think that type-1, type-2, and type-3 decision theorists have all contributed valuable AI-relevant conceptual progress in the past (most obviously, by formulating Newcombâs problem, EDT, and CDT), and I think all three could do more of the same in the future. I think the type-3 decision theorists are making a mistake, but often more in the fashion of an ancient astronomer whoâs accumulating useful and real knowledge but happens to have some false side-beliefs about the object of study, not in the fashion of a theologian whose entire object of study is illusory. (And not in the fashion of a developmental psychologist or historian whose field of subject is too human-centric to directly bear on game theory, AI, etc.)
Iâd expect type-2 decision theorists to tend to be interested in more AI-relevant things than type-1 decision theorists, but on the whole I think the flavor of decision theory as a field has ended up being more type-2/â3 than type-1. (And in this case, even type-1 analyses of ârationalityâ can be helpful for bringing various widespread background assumptions to light.)
This is true if your twin was copied from you in the past. If your twin will be copied from you in the future, however, then you can indeed cause your twin to cooperate, assuming you have the ability to modify your own future decision-making so as to follow son-of-CDTâs prescriptions from now on.
Making the commitment to always follow son-of-CDT is an action you can take; the mechanistic causal consequence of this action is that your future brain and any physical systems that are made into copies of your brain in the future will behave in certain systematic ways. So from your present perspective (as a CDT agent), you can causally control future copies of yourself, as long as the act of copying hasnât happened yet.
(And yes, by the time you actually end up in the prisonerâs dilemma, your future self will no longer be able to causally affect your copy. But this is irrelevant from the perspective of present-you; to follow CDTâs prescriptions, present-you just needs to pick the action that you currently judge will have the best consequences, even if that means binding your future self to take actions contrary to CDTâs future prescriptions.)
(If it helps, donât think of the copy of you as âyouâ: just think of it as another environmental process you can influence. CDT prescribes taking actions that change the behavior of future copies of yourself in useful ways, for the same reason CDT prescribes actions that change the future course of other physical processes.)
Thank you for taking the time to respond as well! :)
Iâm not positive I understand what (1) and (3) are referring to here, but I would say that thereâs also at least a fourth way that philosophers often use the word ârationalâ (which is also the main way I use the word ârational.â) This is to refer to an irreducibly normative concept.
The basic thought here is that not every concept can be usefully described in terms of more primitive concepts (i.e. âreducedâ). As a close analogy, a dictionary cannot give useful non-circular definitions of every possible wordâit requires the reader to have a pre-existing understanding of some foundational set of words. As a wonkier analogy, if we think of the space of possible concepts as a sort of vector space, then we sort of require an initial âbasisâ of primitive concepts that we use to describe the rest of the concepts.
Some examples of concepts that are arguably irreducible are âtruth,â âset,â âproperty,â âphysical,â âexistance,â and âpoint.â Insofar as we can describe these concepts in terms of slightly more primitive ones, the descriptions will typically fail to be very useful or informative and we will typically struggle to break the slightly more primitive ones down any further.
To focus on the example of âtruth,â some people have tried to reduce the concept substantially. Some people have argued, for example, that when someone says that âX is trueâ what they really mean or should mean is âI personally believe Xâ or âbelieving X is good for you.â But I think these suggested reductions pretty obviously donât entirely capture what people mean when they say âX is true.â The phrase âX is trueâ also has an important meaning that is not amenable to this sort of reduction.
[[EDIT: âTruthâ may be a bad example, since itâs relatively controversial and since Iâm pretty much totally unfamiliar with work on the philosophy of truth. But insofar as any concepts seem irreducible to you in this sense, or buy the more general argument that some concepts will necessarily be irreducible, the particular choice of example used here isnât essential to the overall point.]]
Some philosophers also employ normative concepts that they say cannot be reduced in terms of non-normative (e.g. psychological) properties. These concepts are said to be irreducibly normative.
For example, here is Parfit on the concept of a normative reason (OWM, p. 1):
When someone says that a concept they are using is irreducible, this is obviously some reason for suspicion. A natural suspicion is that the real explanation for why they canât give a useful description is that the concept is seriously muddled or fails to grip onto anything in the real world. For example, whether this is fair or not, I have this sort of suspicion about the concept of âdaoâ in daoist philosophy.
But, again, it will necessarily be the case that some useful and valid concepts are irreducible. So we should sometimes take evocations of irreducible concepts seriously. A concept that is mostly undefined is not always problematically âunderdefined.â
When I talk about ânormative anti-realism,â I mostly have in mind the position that claims evoking irreducably normative concepts are never true (either because these claims are all false or because they donât even have truth values). For example: Insofar as the word âshouldâ is being used in an irreducibly normative sense, there is nothing that anyone âshouldâ do.
[[Worth noting, though: The term ânormative realismâ is sometimes given a broader definition than the one Iâve sketched here. In particular, it often also includes a position known as âanalytic naturalist realismâ that denies the relevance of irreducibly normative concepts. I personally feel I understand this position less well and I think sometimes waffle between using the broader and narrower definition of ânormative realism.â I also more generally want to stress that not everyone who makes claims about âcriterion of rightnessâ or employs other seemingly normative language is actually a normative realist in the narrow or even broad sense; what Iâm doing here is just sketching one common especially salient perspective.]]
One motivation for evoking irreducibly normative concepts is the observation thatâin the context of certain discussionsâitâs not obvious that thereâs any close-to-sensible way to reduce the seemingly normative concepts that are being used.
For example, suppose we follow a suggestion once made by Eliezer to reduce the concept of âa rational choiceâ to the concept of âa winning choiceâ (or, in line with the type-2 conception you mention, a âutility-maximizing choiceâ). It seems difficult to make sense of a lot of basic claims about rationality if we use this reductionâand other obvious alternative reductions donât seem to fair much better. To mostly quote from a comment I made elsewhere:
FN15 in my post on normative realism elaborates on this point.
At the same time, though, I do think there are also really good and hard-to-counter epistemological objections to the existance of irreducibly normative properties (e.g. the objection described in this paper). You might also find the difficulty of reducing normative concepts a lot less obvious-seeming or problematic than I do. You might think, for example, that the difficulty of reducing ârationalityâ is less like the difficulty of reducing âtruthâ (which IMO mainly reflects the fact that truth is an important primitive concept) and more like the difficulty of defining the word âsoupâ in a way that perfectly matches our intuitive judgments about what counts as âsoupâ (which IMO mainly reflects the fact that âsoupâ is a high-dimensional concept). So I definitely donât want to say normative realism is obviously or even probably right.
I mainly just want to communicate the sort of thing that I think a decent chunk of philosophers have in mind when they talk about a ârational decisionâ or a âcriterion of rightness.â Although, of course, philosophy being philosophy, plenty of people do of course have in mind plenty of different things.
So, as an experiment, Iâm going to be a very obstinate reductionist in this comment. Iâll insist that a lot of these hard-seeming concepts arenât so hard.
Many of them are complicated, in the fashion of âknowledgeââthey admit an endless variety of edge cases and exceptionsâbut these complications are quirks of human cognition and language rather than deep insights into ultimate metaphysical reality. And where thereâs a simple core we can point to, that core generally isnât mysterious.
It may be inconvenient to paraphrase the term away (e.g., because it packages together several distinct things in a nice concise way, or has important emotional connotations, or does important speech-act work like encouraging a behavior). But when I say it âisnât mysteriousâ, I mean itâs pretty easy to see how the concept can crop up in human thought even if it doesnât belong on the short list of deep fundamental cosmic structure terms.
Why is this a fourth way? My natural response is to say that normativity itself is either a messy, parochial human concept (like âlove,â âknowledge,â âFranceâ) , or itâs not (in which case it goes in bucket 2).
Picking on the concept here that seems like the odd one out to me: I feel confident that there isnât a cosmic law (of nature, or of metaphysics, etc.) that includes âtruthâ as a primitive (unless the list of primitives is incomprehensibly long). I could see an argument for concepts like âintentionality/âreferenceâ, âassertionâ, or âstate of affairsâ, though the former two strike me as easy to explain in simple physical terms.
Mundane empirical âtruthâ seems completely straightforward. Then thereâs the truth of sentences like âFrodo is a hobbitâ, â2+2=4â, âI could have been the presidentâ, âHamburgers are more delicious than battery acidâ⊠Some of these are easier or harder to make sense of in the naive correspondence model, but regardless, it seems clear that our colloquial use of the word âtrueâ to refer to all these different statements is pre-philosophical, and doesnât reflect anything deeper than that âeach of these sentences at least superficially looks like itâs asserting some state of affairs, and each sentence satisfies the conventional assertion-conditions of our linguistic communityâ.
I think that philosophers are really good at drilling down on a lot of interesting details and creative models for how we can try to tie these disparate speech-acts together. But I think thereâs also a common failure mode in philosophy of treating these questions as deeper, more mysterious, or more joint-carving than the facts warrant. Just because you can argue about the truthmakers of âFrodo is a hobbitâ doesnât mean youâre learning something deep about the universe (or even something particularly deep about human cognition) in the process.
Suppose I build a robot that updates hypotheses based on observations, then selects actions that its hypotheses suggest will help it best achieve some goal. When the robot is deciding which hypotheses to put more confidence in based on an observation, we can imagine it thinking, âTo what extent is observation o a [WORD] to believe hypothesis h?â When the robot is deciding whether it assigns enough probability to h to choose an action a, we can imagine it thinking, âTo what extent is P(h)=0.7 a [WORD] to choose action a?â As a shorthand, when observation o updates a hypothesis h that favors an action a, the robot can also ask to what extent o itself is a [WORD] to choose a.
When two robots meet, we can moreover add that they negotiate a joint âcompromiseâ goal that allows them to work together rather than fight each other for resources. In communicating with each other, they then start also using â[WORD]â where an action is being evaluated relative to the joint goal, not just the robotâs original goal.
Thus when Robot A tells Robot B âI assign probability 90% to âitâs noonâ, which is [WORD] to have lunchâ, A may be trying to communicate that A wants to eat, or that A thinks eating will serve A and Bâs joint goal. (This gets even messier if the robots have an incentive to obfuscate which actions and action-recommendations are motivated by the personal goal vs. the joint goal.)
If you decide to relabel â[WORD]â as âreasonâ, I claim that this captures a decent chunk of how people use the phrase âa reasonâ. âReasonâ is a suitcase word, but that doesnât mean there are no similarities between e.g. âdata my goals endorse using to adjust the probability of a given hypothesisâ and âprobabilities-of-hypotheses my goals endorse using to select an actionâ), or that the similarity is mysterious and ineffable.
(I recognize that the above story leaves out a lot of important and interesting stuff. Though past a certain point, I think the details will start to become Gettier-case nitpicks, as with most concepts.)
That essay isnât trying to âreduceâ the term ârationalityâ in the sense of taking a pre-existing word and unpacking or translating it. The essay is saying that what matters is utility, and if a human being gets too invested in verbal definitions of âwhat the right thing to do isâ, they risk losing sight of the thing they actually care about and were originally in the game to try to achieve (i.e., their utility).
Therefore: if youâre going to use words like ârationalityâ, make sure that the words in question wonât cause you to shoot yourself in the foot and take actions that will end up costing you utility (e.g., costing human lives, costing years of averted suffering, costing money, costing anything or everything). And if you arenât using ârationalityâ in a safe ânailed-to-utilityâ way, make sure that youâre willing to turn on a time and stop being ârationalâ the second your conception of rationality starts telling you to throw away value.
âRationalityâ is a suitcase word. It refers to lots of different things. On LessWrong, examples include not just â(systematized) winningâ but (as noted in the essay) âBayesian reasoningâ, or in Rationality: Appreciating Cognitive Algorithms, âcognitive algorithms or mental processes that systematically produce belief-accuracy or goal-achievementâ. In philosophy, the list is a lot longer.
The common denominator seems to largely be âsomething something reasoning /â deliberationâ plus (as you note) âsomething something normativity /â desirability /â recommendedness /â requirednessâ.
The idea of ânormativityâ doesnât currently seem that mysterious to me either, though youâre welcome to provide perplexing examples. My initial take is that it seems to be a suitcase word containing a bunch of ideas tied to:
Goals/âpreferences/âvalues, especially overridingly strong ones.
Encouraged, endorsed, mandated, or praised conduct.
Encouraging, endorsing, mandating, and praising are speech-acts that seem very central to how humans perceive and intervene on social situations; and social situations seem pretty central to human cognition overall. So I donât think itâs particularly surprising if words associated with such loaded ideas would have fairly distinctive connotations and seem to resist reduction, especially reduction that neglects the pragmatic dimensions of human communication and only considers the semantic dimension.
I may write up more object-level thoughts here, because this is interesting, but I just wanted to quickly emphasize the upshot that initially motivated me to write up this explanation.
(I donât really want to argue here that non-naturalist or non-analytic naturalist normative realism of the sort Iâve just described is actually a correct view; I mainly wanted to give a rough sense of what the view consists of and what leads people to it. It may well be the case that the view is wrong, because all true normative-seeming claims are in principle reducible to claims about things like preferences. I think the comments youâve just made cover some reasons to suspect this.)
The key point is just that when these philosophers say that âAction X is rational,â they are explicitly reporting that they do not mean âAction X suits my terminal preferencesâ or âAction X would be taken by an agent following a policy that maximizes lifetime utilityâ or any other such reduction.
I think that when people are very insistent that they donât mean something by their statements, it makes sense to believe them. This implies that the question they are discussingââWhat are the necessary and sufficient conditions that make a decision rational?ââis distinct from questions like âWhat decision would an agent that tends to win take?â or âWhat decision procedure suits my terminal preferences?â
It may be the case that the question they are asking is confused or insensibleâbecause any sensible question would be reducibleâbut itâs in any case different. So I think itâs a mistake to interpret at least these philosophersâ discussions of âdecisions theoriesâ or âcriteria of rightnessâ as though they were discussions of things like terminal preferences or winning strategies. And it doesnât seem to me like the answer to the question theyâre asking (if it has an answer) would likely imply anything much about things like terminal preferences or winning strategies.
[[NOTE: Plenty of decision theorists are not non-naturalist or non-analytic naturalist realists, though. Itâs less clear to me how related or unrelated the thing theyâre talking about is to issues of interest to MIRI. I think that the conception of rationality Iâm discussing here mainly just presents an especially clear case.]]
Just on this point: I think youâre right I may be slightly glossing over certain distinctions, but I might still draw them slightly differently (rather than doing a 2x2 grid). Some different things one might talk about in this context:
Decisions
Decision procedures
The decision procedure that is optimal with regard to some given metric (e.g. the decision procedure that maximizes expected lifetime utility for some particular way of calculating expected utility)
The set of properties that makes a decision rational (âcriterion of rightnessâ)
A claim about what the criterion of rightness is (ânormative decision theoryâ)
The decision procedure that it would be rational to decide to build into an agent (as implied by the criterion of rightness)
(4), (5), and (6) have to do with normative issues, while (1), (2), and (3) can be discussed without getting into normativity.
My current-although-not-firmly-held view is also that (6) probably isnât very sensitive to the what the criterion of rightness is, so in practice can be reasoned about without going too deep into the weeds thinking about competing normative decision theories.
Just want to note that I found the R_ vs P_ distinction to be helpful.
I think using those terms might be useful for getting at the core of the disagreement.
Lightly editing some thoughts I previously wrote up on this issue, somewhat in line with Paulâs comments:
For more on this divide/âpoints of disagreement, see Will MacAskillâs essay on the alignment forum (with responses from MIRI researchers and others)
https://ââwww.alignmentforum.org/ââposts/ââySLYSsNeFL5CoAQzN/ââa-critique-of-functional-decision-theory
and previously, Wolfgang Schwartzâs review of Functional Decision Theory
https://ââwww.umsu.de/ââwo/ââ2018/ââ688
(with some Lesswrong discussion here: https://ââwww.lesswrong.com/ââposts/ââBtN6My9bSvYrNw48h/ââopen-thread-january-2019#WocbPJvTmZcA2sKR6)
Iâd also be interested in Buckâs perspectives on this topic.
See also Paul Christianoâs take: https://ââwww.lesswrong.com/ââposts/âân6wajkE3Tpfn6sd5j/ââchristiano-decision-theory-excerpt
Thanks, I hadnât seen this.
See also bmgâs LW post, Realism and rationality. Relevant excerpt: