Integrity for consequentialists

(Cross-posted from the sideways view.)

For most people I don’t think it’s important to have a really precise definition of integrity. But if you really want to go all-in on consequentialism then I think it’s useful. Otherwise you risk being stuck with a flavor of consequentialism that is either short-sighted or terminally timid.


I aspire to make decisions in a pretty simple way. I think about the consequences of each possible action and decide how much I like them; then I select the action whose consequences I like best.

To make decisions with integrity, I make one change: when I imagine picking an action, I pretend that picking it causes everyone to know that I am the kind of person who picks that option.

If I’m considering breaking a promise to you, and I am tallying up the costs and benefits, I consider the additional cost of you having known that I would break the promise under these conditions. If I made a promise to you, it’s usually because I wanted you to believe that I would keep it. So you knowing that I wouldn’t keep the promise is usually a cost, often a very large one.

Optimal summary of this post.

If I’m considering sharing a secret you told me, and I am tallying up the costs and benefits, I consider the additional cost of you having known that I would share this secret. In many cases, that would mean that you wouldn’t have shared it with me—a cost which is usually larger than whatever benefit I might gain from sharing it now.

If I’m considering having a friend’s back, or deciding whether to be mean, or thinking about what exactly counts as “betrayal,” I’m doing the same calculus. (In practice there are many cases where I am pathologically unable to be really mean. One motivation for being really precise about integrity is recovering the ability to engage in normal levels of being a jerk when it’s actually a good idea.)

This is a weird kind of effect, since it goes backwards in time and it may contradict what I’ve actually seen. If I know that you decided to share the secret with me, what does it mean to imagine my decision causing you not to have shared it?

It just means that I imagine the counterfactual where you didn’t share the secret, and I think about just how bad that would have been—making the decision as if I did not yet know whether you would share it or not.

I find the ideal of integrity very viscerally compelling, significantly moreso than other abstract beliefs or principles that I often act on.


This can get pretty confusing, and at the end of the day this simple statement is just an approximation. I could run through a lot of confusing examples and maybe sometime I should, but this post isn’t the place for that.

I’m not going to use some complicated reasoning to explain why screwing you over is consistent with integrity, I am just going to be straightforward. I think “being straightforward” is basically what you get if you do the complicated reasoning right. You can believe that or not, but one consequence of integrity is that I’m not going to try to mislead you about it. Another consequence is that when I’m dealing with you, I’m going to interpret integrity like I want you to think that I interpret it.

Integrity doesn’t mean merely keeping my word. To the extent I want to interact with you I will be the kind of person you will be predictably glad to have interacted with. To that end, I am happy to do nice things that have no direct good consequences for me. I am hesitant to be vengeful; but if I think you’ve wronged me because you thought it would have no bad consequences for you, I am willing to do malicious things that have no direct good consequences for me.

On the flip side, integrity does not mean that I always keep my word. If you ask me a question that I don’t want to answer, and me saying “I don’t think I should answer that” would itself reveal information that I don’t want to reveal, then I will probably lie. If I say I will do something then I will try to do it, but it just gets tallied up like any other cost or benefit, it’s not a hard-and-fast rule. None of these cases are going to feel like gotchas; they are easy to predict given my definition of integrity, and I think they are in line with common-sense intuitions about being basically good.

Some examples where things get more complicated: if we were trying to think of the same number between 1 and 20, I wouldn’t assume that we are going to win because by choosing 17 I cause you to know that I’m the kind of person who picks 17. And if you invade Latvia I’m not going to bomb Moscow, assuming that by being arbitrarily vindictive I guarantee your non-aggression. If you want to figure out what I’d do in these cases, think UDT + the arguments in the rest of this post + a reasonable account of logical uncertainty. Or just ask. Sometimes the answer in fact depends on open philosophical questions. But while I find that integrity comes up surprisingly often, really hard decision-theoretic cases come up about as rarely as you’d expect.

A convenient thing about this form of integrity is that it basically means behaving in the way that I’d want to claim to behave in this blog post. If you ask me “doesn’t this imply that you would do X, which you only refrained from writing down because it would reflect poorly on you?” then you’ve answered your own question.


Why would I do this? At face value it may look a bit weird. People’s expectations about me aren’t shaped by a magical retrocausal influence from my future decision. Instead they are shaped by a messy basket of factors:

  • Their past experiences with me.

  • Their past experiences with other similar people.

  • My reputation.

  • Abstract reasoning about what I might do.

  • Attempts to “read” my character and intentions from body language, things I say, and other intuitive cues.

  • (And so on.)

In some sense, the total “influence” of these factors must add up to 100%.

I think that basically all of these factors give reasons to behave with integrity:

  • My decision is going to have a causal influence on what you think of me.

  • My decision is going to have a causal influence on what you think of other similar people. I want to be nice to those people. But also my decision is correlated with their decisions (moreso the more they are like me) and I want them to be nice to me.

  • My decision is going to have a direct effect on my reputation.

  • My decision has logical consequences on your reasoning about my decision. After all, I am running a certain kind of algorithm and you have some ability to imperfectly simulate that algorithm.

  • To the extent that your attempts to infer my character or intention are unbiased, being the kind of person who will decide in a particular way will actually cause you to believe I am that kind of person.

  • (And so on.)

The strength of each of those considerations depends on how significant each factor was in determining their views about me, and that will vary wildly from person to person and case to case. But if the total influence of all of these factors is really 100%, then just replacing them all with a magical retrocausal influence is going to result in basically the same decision.

Some of these considerations are only relevant because I make decisions using UDT rather than causal decision theory. I think this is the right way to make decisions (or at least the way that you should decide to make decisions), but your views my vary. At any rate, it’s the way that I make decisions, which is all that I’m describing here.


What about a really extreme case, where definitely no one will ever learn what I did, and where they don’t know anything about me, and where they’ve never interacted with me or anyone similar to me before? In that case, should I go back to being a consequentialist jerk?

There is a temptation to reject this kind of crazy thought experiment—there are never literally zero causal effects. But like most thought experiments, it is intended to explore an extreme point in the space of possibilities:


Of course we don’t usually encounter these extreme cases; most of our decisions sit somewhere in between. The extreme cases are mostly interesting to the extent that realistic situations are in between them and we can usefully interpolate.

For example, you might think that the picture looks something like this:


On this perspective, if I would be a jerk when definitely for sure no one will know then presumably I am at least a little bit of a jerk when it sure seems like no one will know.

But actually I don’t think the graph looks like this.

Suppose that Alice and Bob interact, and Alice has either a 50% or 5% chance of detecting Bob’s jerk-like behavior. In either case, if she detects bad behavior she is going to make an update about Bob’s characteristics. But there are several reasons to expect the 5% chance will have a 10x larger update if it actually happens:

  • If Alice is attempting to impose incentives to elicit pro-social behavior from Bob, then the size of the disincentive needs to be 10x larger. This effect is tempered somewhat if imposing twice as large a cost is more than twice as costly for Alice, but still we expect a significant compensating factor.

  • For whatever reference class Alice is averaging over (her experiences with Bob, her experiences with people like Bob, other people’s experiences with Bob...) Alice has 1/​10th as much data about cases with a 5% chance of discovery, and so (once the total number of data points in the class is reasonably large) each data point has nearly 10x as much influence.

  • In general, I think that people are especially suspicious of people cheating when they probably won’t get caught (and consider it more serious evidence about “real” character), in a way that helps compensate for whatever gaps exist in the last two points.

In reality, I think the graph is closer to this:


Our original thought experiment is an extremely special case, and the behavior changes rapidly as soon as we move a little bit away from it.

At any rate, these considerations significantly limit the applicability of intuitions from pathological scenarios, and tend to push optimal behavior closer to behaving with integrity.

This effect is especially pronounced when there are many possible channels through which my behavior can effect others’ judgments, since then a crazy extreme case must be extreme with respect to every one of these indicators: my behavior must be unobservable, the relevant people must have no ability to infer my behavior from tells in advance, they must know nothing about the algorithm I am running, and so on.


Integrity has one more large advantage: it is often very efficient. Being able to make commitments is useful, as a precondition for most kinds of positive-sum trade. Being able to realize positive-sum trades, without needing to make explicit commitments, is even more useful. (On the revenge side things are a bit more complicated, and I’m only really keen to be vengeful when the behavior was socially inefficient in addition to being bad for my values.)

I’m generally keen to find efficient ways to do good for those around me. For one, I care about the people around me. For two, I feel pretty optimistic that if I create value, some of it will flow back to me. For three, I want to be the kind of person who is good to be around.

So if the optimal level of integrity from a social perspective is 100%, but from my personal perspective would be something close to 100%, I am more than happy to just go with 100%. I think this is probably one of the most cost-effective ways I can sacrifice a (tiny) bit of value in order to help those around me.

On top of that:

  • Integrity is most effective when it is straightforward rather than conditional.

  • “Behave with integrity” is a whole lot simpler (computationally and psychologically) than executing a complicated calculation to decide exactly when you can skimp.

  • Humans have a bunch of emotional responses that seem designed to implement integrity—e.g. vengefulness or a desire to behave honorably—and I find that behaving with integrity also ticks those boxes.

After putting all of this together, I feel like the calculus is pretty straightforward. So I usually don’t think about it, and just (aspire to) make decisions with integrity.


Many consequentialists claim to adopt firm rules like “my word is inviolable” and then justify those rules on consequentialist grounds. But I think on the one hand that approach is too demanding—the people I know who take promises most seriously basically never make them—and on the other it does not go far enough—someone bound by the literal content of their word is only a marginally more useful ally than someone with no scruples at all.

Personally, I get a lot of benefit from having clear definitions; I feel like the operationalization of integrity in this post has worked pretty well, and much better than the deontological constraints it replaced. That said, I’m always interested in adopting something better, and would love to hear pushback or arguments for alternative norms.