I have been posting a lot on instrumental geometric rationality, with Nash bargaining, Kelly betting, and Thompson sampling. I feel some duty to also post about epistemic geometric rationality, especially since information theory is filled with geometric maximization. The problem is that epistemic geometric rationality is kind of obvious.

A Silly Toy Model

Let’s say you have some prior beliefs $P_{0}$ at time 0, and you have to choose some new beliefs $P_{1}$ for time 1. $P_{0}, P_{1} \in Δ W$ are both distributions over worlds. For now, let’s say you don’t make any observations. What should your new beliefs be?

The answer is obvious, you should set $P_{1} = P_{0}$ . However, if we want to phrase this as a geometric maximization, we can say

$P_{1} = {argmax}_{P \in Δ W} G_{w \sim P_{0}} P (w)$ .

This is saying, imagine the true world is sampled according to $P_{0}$ , and geometrically maximize the probability you assign to the true world. I feel silly recommending this because it is much more complicated that $P_{1} = P_{0}$ . However, it gives us a new lens that we can use to generalize and consider alternatives.

For example, we can consider the corresponding arithmetic maximization,

$P_{1} = {argmax}_{P \in Δ W} E_{w \sim P_{0}} P (w)$ .

What would happen if we were to do this? We would find the world with the highest probability, and put all our probability mass on that world. We would anticipate that world, and ignore all the others.

This is a stupid way to manage our anticipation. Nobody is going around saying we should arithmetically maximize the probability we assign to the true world. (However, people are going around saying we should arithmetically maximize average utility, or arithmetically maximize our wealth.)

Not only does arithmetic maximization put all our anticipatory eggs in one basket, it also opens us up to all sorts of internal politics. If we take a world and add some extra features to it to split it up into multiple different worlds, this changes the evaluation of which world is most efficient to believe in. This is illustrating two of the biggest virtues of geometric rationality: proportional representation, and gerrymander resistance.

A Slightly Less Silly Toy Model

Now, lets assume we make some observation $X \subseteq W$ , so we know that the real world is in $X$ . We will restrict our attention to probability distributions that assign probability 1 to $X$ . However, if we try to set

$P_{1} = {argmax}_{P \in Δ W, P (X) = 1} G_{w \sim P_{0}} P (w)$ ,

we run into a problem. We are geometrically maximizing the same quantity as before, subject to the constraint that $P (X) = 1$ , but the problem is that the geometric expectation is 0 no matter what we do, because $P (w) = 0$ for any $w \notin X$ .

However, this is easy to fix. Instead of requiring that $P (X) = 1$ , we can require that $P (X) \geq b$ , and take a limit as $b$ approaches 1 from below. Thus, we get

$P_{1} = {lim}_{b \to 1^{-}} {argmax}_{P \in Δ W, P (X) \geq b} G_{w \sim P_{0}} P (w)$ .

Turns out this limit exists, and $P_{1}$ corresponds exactly to Bayesian updating on the event $X$ . Again, this is more complicated that just defining Bayesian updating like a normal person, so I am being a little obnoxious by treating this as an application of geometric rationality.

However, I think you can see the connection to geometric rationality in Bayesian updating directly. When I assign probability 1 to an event $X$ that I used to assign probability $\frac{1}{2}$ , I have all this extra probability mass in $X$ worlds. How should I distribute this probability mass across the $X$ worlds? Bayesian updating recommends that I geometrically scale all the probabilities up by the same amount. Why are we scaling probabilities up rather than e.g. arithmetically increasing all the probabilities by the same amount?

Because we actually care about our probabilities in a geometric (multiplicative) way!

It is much worse to decrease your probability of the true world from .11 to .01 than it is to decrease your probability of the true world from .9 to .8. This is because the former is geometrically a much larger decrease.

Generalized Updates

This model stops being silly when we start using it for something new, where the geometric maximization actually helps. For that, we can consider updating on some more strange stuff. Imagine you want to update on the fact that X and Y are independent. How do you observe this? Well maybe you looked into the mechanism of how the world is generated, and see that there is no communication between the thing that generates X and the thing that generates Y. Or maybe you don’t actually observe the fact, but you want to make your probability distribution simpler and easier to compress by separating out the X and Y parts. Or maybe $X$ represents your choice, and $Y$ represents your parents, and you want to implement a causal counterfactual via conditioning. Anyway, we have a proposal for updating on this.

$P_{1} = {argmax}_{P \in Δ W, P (X) P (Y) = P (X \land Y)} G_{w \sim P_{0}} P (w)$ .

In general, we can enforce any restrictions we want on our new probability distribution, and geometric maximization gives us a sane way to update.

Aggregating Beliefs

Now let’s say I have a bunch of hypotheses about the world. $H$ is my set of hypotheses, and each hypothesis is a distribution on worlds, $H \subseteq Δ W$ . I also have a distribution $P \in Δ H$ , representing my credence on each of these hypotheses. How should interpret myself as having beliefs about the world, rather than just beliefs about hypotheses?

Well, my beliefs about the world are just ${argmax}_{Q \in Δ W} G_{h \sim P} G_{w \sim h} Q (w)$ . Again, I could have done this in a simpler way, not phrased as geometric maximization. However, I would not be able to do the next step.

Now, let’s consider the following modification: Each hypothesis is no longer a distribution on $W$ , but instead a distribution on some coarser partition of $W$ . Now ${argmax}_{Q \in Δ W} G_{h \sim P} G_{w \sim h} Q (w)$ is still well defined, and other simpler methods of defining aggregation are not. (The argmax might not be a single distribution, but it will be a linear space of distributions, and it will give a unique probability to every event in any of the sigma algebras of the hypotheses.)

I have a lot to say about this method of aggregating beliefs. I have spent a lot of time thinking about it over the last year, and think it is quite good. It can be used to track the difference between credence (probability) and confidence (the strength of your belief), and thinking a lot about it has also caused some minor shifts in the way I think about agency. I hope to write about it a bunch soon, but that will be in a different post/sequence.

The Least Controversial Application of Geometric Rationality

A Silly Toy Model

A Slightly Less Silly Toy Model

Generalized Updates

Aggregating Beliefs