Request for input on multiverse-wide superrationality (MSR)
I am currently working on a research project as part of CEA’s summer research fellowship. I am building a simple model of so-called “multiverse-wide cooperation via superrationality” (MSR). The model should incorporate the most relevant uncertainties for determining possible gains from trade. To be able to make this model maximally useful, I would like to ask others for their opinions on the idea of MSR. For instance, what are the main reasons you think MSR might be irrelevant or might not work as it is supposed to work? Which questions are unanswered and need to be addressed before being able to assess the merit of the idea? I would be happy about any input in the comments to this post or via mail to johannes@foundational-research.org.
An overview of resources on MSR, including introductory texts, can be found on the link above. To briefly illustrate the idea, consider two artificial agents with identical source code playing a prisoner’s dilemma. Even if both agents cannot causally interact, one agent’s action provides them with strong evidence about the other agent’s action. Evidential decision theory and recently proposed variants of causal decision theory (Yudkowsky and Soares, 2018; Spohn, 2003; Poellinger, 2013) say that agents should take such evidence into account when making decisions. MSR is based on the idea that (i) humans on Earth are in a similar situation as the two AI agents: there probably is a large or infinite multiverse containing many exact copies of humans on Earth (Tegmark 2003, p. 464), but also agents similar but non-identical to humans. (ii) If humans and these other, similar agents take each other’s preferences into account, then, due to gains from trade, everyone is better off than if everyone were to pursue only their own ends. It follows from (i) and (ii) that humans should take the preferences of other, similar agents in the multiverse into account, to produce the evidence that they do in turn take humans’ preferences into account, which leaves everyone better off.
According to Oesterheld (2017, sec. 4), this idea could have far-reaching implications for prioritization. For instance, given MSR, some forms of moral advocacy could become ineffective: advocating for their particular values provides agents with evidence that others do the same, potentially neutralizing each other’s efforts. Moreover, MSR could play a role in deciding which strategies to pursue in AI alignment. It could become especially valuable to ensure an AGI will engage in a multiverse-wide trade.
A few doubts:
It seems like MSR requires a multiverse large enough to have many well-correlated agents, but not large enough to run into the problems involved with infinite ethics. Most of my credence is on no multiverse or infinite multiverse, although I’m not particularly well-read on this issue.
My broad intuition is something like “Insofar as we can know about the values of other civilisations, they’re probably similar to our own. Insofar as we can’t, MSR isn’t relevant.” There are probably exceptions, though (e.g. we could guess the direction in which an r-selected civilisation’s values would vary from our own).
I worry that MSR is susceptible to self-mugging of some sort. I don’t have a particular example, but the general idea is that you’re correlated with other agents even if you’re being very irrational. And so you might end up doing things which seem arbitrarily irrational. But this is just a half-fledged thought, not a proper objection.
And lastly, I would have much more confidence in FDT and superrationality in general if there were a sensible metric of similarity between agents, apart from correlation (because if you always cooperate in prisoner’s dilemmas, then your choices are perfectly correlated with CooperateBot, but intuitively it’d still be more rational to defect against CooperateBot, because your decision algorithm isn’t similar to CooperateBot in the same way that it’s similar to your psychological twin). I guess this requires a solution to logical uncertainty, though.
Happy to discuss this more with you in person. Also, I suggest you cross-post to Less Wrong.
Re 4): Correlation or similarity between agents is not really necessary condition for cooperation in the open source PD. LaVictoire et al. (2012) and related papers showed that ‘fair’ agents with completely different implementations can cooperate. A fair agent, roughly speaking, has to conform to any structure that implements “I’ll cooperate with you if I can show that you’ll cooperate with me”. So maybe that’s the measure you’re looking for.
A population of fair agents is also typically a Nash equilibrium in such games so you might expect that they sometimes do evolve.
Source: LaVictoire, P., Fallenstein, B., Yudkowsky, E., Barasz, M., Christiano, P., & Herreshoff, M. (2014, July). Program equilibrium in the prisoner’s dilemma via Löb’s theorem. In AAAI Multiagent Interaction without Prior Coordination workshop.
The example you’ve given me shows that agents which implement exactly the same (high-level) algorithm can cooperate with each other. The metric I’m looking for is: how can we decide how similar two agents are when their algorithms are non-identical? Presumably we want a smoothness property for that metric such that if our algorithms are very similar (e.g. only differ with respect to some radically unlikely edge case) the reduction in cooperation is negligible. But it doesn’t seem like anyone knows how to do this.
One way I imagine dealing with this is that there is an oracle that tells us with certainty, for two algorithms and their decision situations, what the counterfactual possible joint outputs are. The smoothness then comes from our uncertainty about (i) the other agents’ algorithms (ii) their decision situation (iii) potentially the outputs of the oracle. The correlations vary smoothly as we vary our probability distributions over these things, but for a fully specified algorithm, situation, etc., the algorithms are always either logically identical or not.
Unfortunately, I don’t know what the oracle would be doing in general. I could also imagine that, when formulated this way, the conclusion is that humans never correlate with anything, for instance.
Hey, a rough point on a doubt I have. Not sure if it’s useful/novel.
Going through the mental processes of a utilitarian (roughly defined) will correlate with others making more utilitarian decisions as well (especially when they’re similar in relevant personality traits and their past exposure to philosophical ideas).
For example, if you act less scope-insensitive, ommission-bias-y, or ingroup-y, others will tend to do so as well. This includes edge cases – e.g. people who otherwise would have made decisions that roughly fall in the deontologist or virtue ethics bucket.
Therefore, for every moment you end up shutting off utilitarian-ish mental processes in favour of ones where you think you’re doing moral trade (including hidden motivations like rationalising acting from social proof or discomfort in diverging from your peers), your multi-universal compatriots will do likewise (especially in similar contexts).
(In case it looks like I’m justifying being a staunch utilitarian here, I have a more nuanced anti-realism view mixed in with lots of uncertainty on what makes sense.)
I remain unsure with MSR how to calculate the measure of agents in worlds holding positions to trade with so that we can figure out how much we should acausally trade with each. Also, how to address uncertainty about if anyone will independently arrive at the same position you hold and so be able to acausally trade with you since you can’t tell them about what you would actually prefer.
I still have doubts as to whether you should pay in Counterfactual Mugging since I believe that (non-quantum) probability is in the map rather than the territory. I haven’t had the opportunity to write up these thoughts yet as my current posts are building up towards it, but I can link you when I do.