The Natural State is Goodhart

Epistemic Status: Meant to describe a set of beliefs that I have about accidental optimization pressures, and be a reference post for a thing I can refer back to later.

Why do we live in worlds of bureaucracy and Lost Purpose? Because this is the default state of problem-solving, and everything else is an effortful push against Goodharting. Humans are all problem-solving machines, and if you want to experience inner misalignment inside your own brain, just apply anything less than your full attention to a metric you’re trying to push up.

People claim to want things like more legroom, or comfier seats, or better service, or smaller chances of delays and cancellations. But when you actually sit down and book a flight, they are ordered by cost, and if you’re not a frequent flier then you generally choose the flight with the lowest sticker cost. This leads to a “race to the bottom” amongst airlines to push everything possible out of the sticker price and nickel-and-dime you—thereby causing the cheapest flights to actually be more expensive and worse.

I was talking to a mentor of mine / giving her feedback and trying to work out how to best approach a problem. Sometimes I said things that she found helpful, and she noted these out loud. We then realized this disrupted conversation too much, so we changed to having her recognize my helpful sentences with a snap. This might have worked well, had I not immediately noticed my brain Goodharting towards extracting her snaps, instead of actually trying to figure out solutions to the problem and saying true things and improving my own models.

There is a point that I’m trying to make here, which I think mostly fails to get made by the current writing on Goodhart’s law. It’s not just an explanation for the behavior of [people dumber than you]. Me, you, all of us, are constantly, ²⁴⁄₇. Goodharting towards whatever outcome fits our local incentives.

This becomes even more true for groups of people and organizations. For example, EAG(x)s have a clear failure mode along this dimension. From reading retrospectives (EAGx Berkeley and EAGx Boston), they sure do seem to focus a lot on making meaningful connections and hyping people up about EA ideas and the community, and a lot of the retrospective is about how much people enjoyed EAG. I don’t mean to call EAG out specifically, but instead to highlight a broader point—we’re not a religion trying to spread a specific gospel; we’re a bunch of people trying to figure out how to figure out what’s true, and do things in the world that accomplish our goals. It does sure seem like we’re putting a bunch of optimization pressure into things that don’t really track our final goals, and we should step back and be at least concerned about this fact.

Some parts of the rationality community do a similar thing. I notice a circuit in my own brain that Goodharts towards certain words / ways of speaking because they’re more “rational.” Like, I personally have adopted this language, but actually talking about “priors” and “updates” and appending “or something” to the end of sentences does not make you better at finding the truth. You’re not a better Bayesian reasoner purely because you use words that correspond to Bayesian thinking. (The counterargument here is the Sapir-Whorf hypothesis, which weakens but does not kill this point—I think many of the mannerisms seen as desirable by people in the rationality community and accepted as status or ingroup indicators track something different from truth.)

By default we follow local incentives, and we should to be quite careful to step back every once in a while and really, properly make sure that we are optimizing for the right purposes. You should expect the autopilot that runs your brain, or the organizations that you run, or the cultures you are a part of, to follow local incentives and Goodhart towards bad proxies for the things you actually want—unless you push strongly in the other direction.