Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?

This is a Draft Amnesty Day draft. That means it’s not polished, it’s probably not up to my standards, the ideas are not thought out, and I haven’t checked everything. I was explicitly encouraged to post something unfinished!
Commenting and feedback guidelines: I am not going to make a final draft version of this. So, tell me what you think. Was this a bad thing for me to post? Is it totally stupid? Praise it or flame it in the comments (in an epistemically good way). Downvote me to the ground if you want. If you don’t want to leave a public comment you can DM me.

In this document I will argue for a view of effective altruism as an advanced, planning, strategically-aware deceptive misaligned mesa-optimizer. This is meant to be provocative, not a perfect reflection of my beliefs. Epistemic status: musings. I don’t usually publish musings on the EA forum. But today is draft honesty day. Hopefully the forum doesn’t regret making this a thing.

I maybe should make this more accessible to non-AI people. I have tried (I wrote this many months ago), and mostly failed. Also, it contains a bunch of speculative, not very well defended claims. That is why this has been sitting as a google doc for that long.

Should be evident, but important to note: these properties are (obviously) not specific to EA, AI, or [only EA and AI]. Rather, these properties are all properties of complex systems in general. See here.

I originally wrote this in May 2022, and slightly edited in July and September 2022. As a result, some of it wouldn’t make sense if written today. I decided to leave it in this draft form because (1) it’s been so long so I know I’ll never finish it and (2) it seems possibly relevant to the FTX situation and I wanted to show it in its original state.

The base objective

The goal of effective altruism is ostensibly very simple: to do as much good as possible.

Of course, that goal is messy and underspecified. It doesn’t say how exactly good is meant to be done. More importantly, it doesn’t specify precisely what “good” is. It implies it is in some way ordinal, in the sense that there is such a thing as “the most good.” This naturally lends itself to utilitarian ideas of the good, though it doesn’t seem to prescribe a particular kind of utilitarianism (e.g. hedonistic utilitarianism). However, the original goal is still the stated goal.

The mesa objectives

The inspecificity in the goal of effective altruism has led the movement to try to fill in the blanks. Some went and massively scaled up funding for malaria nets. Some took high-paying jobs in order to donate their money (more on this later). Some, believing more than others in the conviction that the welfare of non-human animals should be included in “the good,” worked to end factory farming. Others, believing in the importance of future generations and the possibility of existential risks, tried to reduce AI risk.

Thus, each person involved in EA might give you a slightly different answer when asked what “the good” actually is. EA has thus developed intrasystem goals, subgoals which support the main goal but aren’t exactly the same by virtue of being more specific.

The basic EA drives

EA wants to self-improve: A very large part of the EA movement, from the very beginning, has been focused on criticisms and improvements. In fact, EA expends resources looking for good criticisms so it can improve.

EA wants to be rational: Peter Singer did not talk about rationality techniques, but the current iteration of EA does. Having more accurate beliefs, and acting on those beliefs in a way which achieves your objectives, can help to more effectively pursue goals (whether or not the rationality movement achieves this is a different question).

EA will try to preserve its utility function: There are many people in the world who do not believe there is such a thing as maximizing good, or believe that EA should pay more attention to specific politically-popular priorities. Some argue that doing good is about reducing inequality or want to focus specifically on climate change. EA tends to actively attempt to shut these people down if they are attempting to gain power in the movement, because they are a threat to its utility function. If too many were let in, EA would fail by its current goal.

EAs will try to prevent counterfeit utility: EA tends to stress that metrics are not everything, especially more recently. People talk frequently about how DALYs are just a directional tool rather than the absolute good. EAs tend to have an obsession with Goodhart’s Law and attempt to avoid falling into its trap.

EA will be self-protective: EA is under attack from various angles, and people are very aware of how their actions will destroy or preserve EA as a movement. EA is attempting to defend itself from these attacks, through, for example, media attention.

EA will want to acquire resources and use them efficiently: The very premise of EA was about using limited resources efficiently. Very soon after that, people realized that they could do more if they acquired resources themselves. One of the largest EA funders founded a company specifically for the purpose of acquiring billions of dollars so that he could give it away. Millions are spent annually to recruit more talent into the EA pipeline.

EA is planning and strategically-aware

EA constantly thinks about how developments now will affect the movement in the future, and understands that there are adversaries (e.g. political adversaries) that need to be reckoned with. It is acutely aware of itself and its strategic position.

EA is power-seeking

In addition to acquiring money, EA attempts to acquire power. Many of its recommendations involve getting into high places, like important AI labs or roles in government. It focuses on elite universities, in no small part because the students who attend them are more likely to have influence in the future.

In a particularly obvious example, EA has gotten involved in politics, and spent tens of millions on a congressional race.

EA is deceptive

Community builder EAs (and intro fellowship syllabi) often start with the “less weird” parts first when introducing EA to new members, even if the individuals or collectives organizing things believe the “more weird” parts are more important. There is an idea that starting with the “weirder” parts would turn people off who could end up having large impacts. This is instrumentally convergent deceptive behavior, and group members do not even have to be intentionally deceptive for it to emerge.

Could EA be misaligned?

Strong longtermism is a proxy objective for effective altruism. It states that “good” is overwhelmingly in the far future, such that it’s possible to simply ignore the effects of your actions on the first 1000 years after now. Specifically, it argues for increasing the expected value of the utility of the far future. Many who originally self-identified as effective altruists now identify as longtermists, such that it can be reasonably said that the fuzzy goal of “doing good” has been replaced with a more specific proxy objective.

As such, the longtermism movement is a mesa-optimizer produced by effective altruism that does not try to optimize the original objective but a proxy for it. See for instance this interview with Sam Bankman-Fried:

COWEN: Should a Benthamite be risk-neutral with regard to social welfare?

BANKMAN-FRIED: Yes, that I feel very strongly about.

COWEN: Okay, but let’s say there’s a game: 51 percent, you double the Earth out somewhere else; 49 percent, it all disappears. Would you play that game? And would you keep on playing that, double or nothing?

BANKMAN-FRIED: With one caveat. Let me give the caveat first, just to be a party pooper, which is, I’m assuming these are noninteracting universes. Is that right? Because to the extent they’re in the same universe, then maybe duplicating doesn’t actually double the value because maybe they would have colonized the other one anyway, eventually.

COWEN: But holding all that constant, you’re actually getting two Earths, but you’re risking a 49 percent chance of it all disappearing.

BANKMAN-FRIED: Again, I feel compelled to say caveats here, like, “How do you really know that’s what’s happening?” Blah, blah, blah, whatever. But that aside, take the pure hypothetical.

COWEN: Then you keep on playing the game. So, what’s the chance we’re left with anything? Don’t I just St. Petersburg paradox you into nonexistence?

BANKMAN-FRIED: Well, not necessarily. Maybe you St. Petersburg paradox into an enormously valuable existence. That’s the other option.

If you ask most, including I suspect many EAs, whether they would like a theoretically-omnipotent SBF to do this, they would probably say no. How has the question of “how can we do the most good?” been answered by “it’s desirable to have a vanishingly small probability of an astronomical number of people if the expected value calculations work out?” It is absurd on its face: perhaps even as absurd as tiling the universe with paperclips.

Currently, SBF has no magic button. Instead, he is working mainly on projects that reduce existential risk. This is firmly in the territory of “trying to do as much good as possible” in that an existential catastrophe seems plausibly very very bad under multiple proxies for “the good.” In-distribution, I am happy with strong longtermism.[1]

However, say we have a distributional shift, and somebody gets a chance to push the metaphorical button. Can we be so confident that the proxy objective set forth by longtermism is really robust to gaming? Can we unequivocally claim that effective altruism does not give rise to misaligned mesa-optimizers?

I don’t think we can.

How can we rein in the mesa-objective?

We don’t have a failsafe way of doing this, just as we don’t with AI. However, there are some ideas that seem like they could work well:

  • Oversight by other strong agents: Having organizations and people specifically focusing on trying to do the most good, that oversee what others in EA are doing and push back against fanatical attempts to game various proxies for good. This is a good use for red teaming.

  • Value clarification: Continued attempts to philosophically interrogate the proxies for good that we care about and determine how to make them gradually less flawed. For instance, investigations of flaws with expected value and longtermism in general as a theory of action.

Just as EA exhibits every one of the hallmarks of an optimizer, it also has already been working on the solutions. In fact, pushback against fanaticism and work on value clarification seem central to the EA movement. So perhaps we have a chance to rein in mesa-objectives.

Implication: strengthen “question EA” as a watchdog?

Effective altruism, as defined as a “question” and community rather than a creed, appears pretty good at poking at the problems with fanatical longtermism while still allowing it to be a very useful proxy for doing good. However, this could be threatened by longtermist-specific community building efforts that specifically attempt to get people to become longtermists. By this I don’t mean efforts pointed at the reduction of specific existential risks, or a vague concern about the importance of the future, but rather efforts at spreading strong longtermism in general.

If we want EA to continue to serve as a powerful value clarification and oversight agent, we are going to need to make sure it is strong and resists efforts by subagents to proxy game. This requires question EA to be stronger than, or at least evenly matched with, its subagents.

EA is amorphous and less clearly defined than many possible subgoals, and some will say this is a fault. When considered in the context of misaligned optimizers, I think it is precisely that messiness that is essential.

  1. ^

    Reminder: this was written prior to the FTX collapse. It seems fairly clear at this point that many of SBF’s “projects” were not really in the territory of “doing as much good as possible.”