This post is going to mostly be propaganda for Thompson sampling. However, the presentation is quite different from the standard presentation. I will be working within a toy model, but I think some of the lessons will generalize. I end with some discussion of fairness and exploitation.

A Exploration/Exploitation Toy Model

Imagine you are interacting with the world over time. We are going to make a couple substantial assumptions. Firstly, we assume that you know what you are trying to optimize. Secondly, we assume that you get high quality feedback on the thing you are trying to optimize. In the spirit of the online learning community, we will be referring to your goal as “reward” rather than “utility.”

On day $t$ , you choose an action, $a_{t} \in A$ , and then the environment responds with an observation $e_{t} \in E$ . Both you and the environment may respond randomly, and may react to the entire history of what has happened. (If you want the environment to act first, you could just make $a_{0}$ not matter.) We assume we have some fixed bounded reward function $r : E \to R$ , that you know. (If you want reward to be more of a function of the history, you can just include more history information in $e_{t}$ .)

We are going to be imagining agents that are myopically choosing $a_{t}$ , so as to maximize $r (e_{t})$ . The choice to do this, rather than e.g. maximizing the total reward of the next $m$ time steps, is mostly for simplicity. However, the choice not to have preferences that are over the entire future is basically the assumption that you get high quality feedback on what we are trying to optimize, which is a substantial philosophical assumption.

So, the environment can be thought of as a function, that takes in a history, $(a_{0} e_{0}) (a_{1} e_{1}) \dots (a_{t - 1} e_{t - 1}) a_{t}$ , and returns a probability distribution on $e_{t}$ which we then sample from. If the agent knew what the environment was, the agent would just choose the $a_{t}$ each round which maximizes the the expectation of $r (e_{t})$ . However, we want to model an agent that is uncertain about what environment it is interacting with, and learning over time.

Let us say that the agent has some probability distribution over possible environments. The set $H$ (for hypotheses) of possible environments is the set of functions that take in a string of the form $(a_{0} e_{0}) (a_{1} e_{1}) \dots (a_{t - 1} e_{t - 1}) a_{t}$ , and output a distribution on $E$ . We are given a distribution $μ$ on $H$ . I will discuss three different decision procedures to get an action out of this distribution $μ$ , which I will call AIXI, plurality, and Thompson sampling.

AIXI

The thing I am calling the AIXI decision procedure (not to be confused with actual AIXI proposal, which among other things assumes that $μ$ is the universal prior) works as follows.

$a_{t} (O) = {argmax}_{p \in Δ A} E_{h \sim (μ | O)} E_{a_{t} \sim p} E_{e_{t} \sim h (O, a_{t})} r (e_{t})$ ,

where $O = a_{0} e_{0} \dots a_{t - 1} e_{t - 1}$ represents the observations on previous days, and $(μ | O)$ represents the posterior distribution on hypotheses you get after conditioning on $O$ .

The above definition is basically just saying “maximize the expected reward in the next round”, but is separating the expectation up into three parts. The inner most expectation is the expected reward of a given hypothesis and an action. The middle expectation is the expectation over what action the agent ends up taking. The outer expectation is the expectation over various different hypotheses about what kind of environment the agent is interacting with. In the outer expectation, the agent is conditioning on all of its observations so far, so the probability of each hypothesis is scaled by the probability that hypothesis would have assigned to the past environmental behavior. (The agent is not specifically only updating for the outer expectation, it just does not have any effect on the inner two expectations.)

You may notice there are two sources of uncertainty over the environment. There is the uncertainty in $μ$ over various hypotheses, and there is also uncertainty within each hypothesis coming from the fact that the hypotheses are themselves probabilistic. We could combine these into one, and just have all hypotheses be deterministic, (which is what the original AIXI does). This would have no effect on the AIXI proposal, but it would have an effect on the next two proposals.

Further, you may notice, that there is no incentive to randomize here, so we could get rid of another source of uncertainty by only considering deterministic actions. Again, this will have no effect here, but will have an effect in our final proposal.

Plurality

The thing I am calling the plurality decision procedure works as follows. Let $O$ and $(μ | O)$ be defined as above. Given a hypothesis $h \in H$ , let $m (h, O) = {argmax}_{a \in A} E_{e_{t} \sim h (O, a)} (r (e_{t}))$ . In other words, $m (h, O)$ is the action that $h$ believes gives the highest expected utility, given observation $O$ .

$a_{t} (O) = {argmax}_{p \in Δ A} E_{h \sim (μ | O)} P_{a_{t} \sim p} (a_{t} = m (h, O))$ .

This can be thought of as putting your choice of action up to a vote. Each hypothesis chooses its favorite action. The expectation and the probability in the above formula combine into one big probability over hypotheses and actions, so you are essentially choosing so as to maximize the probability that a randomly chosen hypothesis endorses your randomly chosen action as the best action.

Like last time, there is no incentive to randomize, so we can actually view this as choosing an action that maximizes the probability that a randomly chosen hypothesis thinks that is the best action.

Unlike with the AIXI proposal, it now does matter where we draw the line between uncertainty within the hypothesis, and uncertainty across different hypotheses, because uncertainty within hypotheses essentially forms voting blocs.

The main benefit of this proposal over the last, (more elegant) proposal, is that it has some resistance to Pascal’s mugging situations, because a hypothesis still only gets to vote proportional to its probability, regardless of how much utility differential it promises you.

Thompson Sampling

Let $O$ , $(μ | O)$ , and $m (h, O)$ be defined as above. The thing I am calling the Thompson sampling proposal works as follows:

$a_{t} (O) = {argmax}_{p \in Δ A} G_{h \sim (μ | O)} P_{a_{t} \sim p} (a_{t} = m (h, O))$ ,

where $G$ is the geometric expectation, and is equivalent to the exponential of the expectation of the logarithm. (The only difference between this and the plurality proposal is replacing one arithmetic expectation with a geometric expectation.)

In this proposal, there is incentive to randomize. Indeed, you will end up choosing each action with probability equal to the probability that a randomly chosen hypothesis thinks that that action is the best action (modulo confusing stuff related to ties, which we can ignore). If you have been following my recent posts, you should not be surprised that proportional representation comes out of taking a geometric expectation.

In a sentence, Thompson sampling can be thought of as “Geometrically maximize, with respect to your posterior on hypotheses, the probability that you choose the best action.”

Once again, this is not the standard presentation of Thompson sampling. Indeed, Thompson sampling is not normally presented in a way involving a geometric expectation (or even an expected logarithm) at all. It is instead presented as probability matching, but the probability matching can be reformulated as geometric maximization.

Density Zero Exploration

Let us assume the true hypothesis $h_{t r u e}$ is given positive probability in $μ$ . We would like it to be the case that we are able to learn this fact, and act accordingly. In particular, we would like it if the limit as $t$ goes to infinity of the probability that $a_{t} = m (h_{t r u e}, O)$ converges to 1. Thompson sampling has this property. AIXI and plurality do not.

To see this, imagine that there is a hypothesis, $h_{w o l v e s}$ that is almost true, but thinks that the action $a$ will cause you to get eaten by wolves, and thus get very low reward.

AIXI and plurality will, if their priors assign enough probability to the $h_{w o l v e s}$ , never choose $a$ , and never update, because $h_{w o l v e s}$ will never make any verifiably wrong predictions. If the true hypothesis would have you take action $a$ some of the time, you will never learn this, and never take action $a$ .

The standard fix for this is $ε$ exploration. If you choose a random action with probability $ε$ , and otherwise use AIXI or plurality, you will successfully learn there are no wolves, and take the right action (with probability $1 - ε$ ). But Thompson sampling takes the right action with probability converging to 1!

You might think you can just have your exploration change over time and send $ε_{t}$ to 0 slowly, like by taking $ε = 1 / t$ . But nope, this doesn’t work. If $h_{w o l v e s}$ only says you will get eaten by wolves on days that are a power of 2, it will with positive probability never make a wrong prediction, and still control your action (using AIXI or plurality) some of the time forever. Importantly, it isn’t just controlling your action at random times. Maybe the days that are a power of 2 are the only ones that really matter.

It is actually hard to explore enough to not believe $h_{w o l v e s}$ without exploring with positive density. (By positive density, I mean exploring such that the limit as $t$ goes to infinity of the proportion of previous days you explored does not go to 0.) However, Thompson sampling pulls this off, it part by essentially listening to the hypotheses about when is the best time to explore, and not exploring when all the hypotheses agree.

Thompson sampling is exploring just enough to learn, which is to say it is asymptotically exploring almost never! AIXI and plurality are not exploring at all.

(As a quick sketch of an argument for why Thompson sampling has the right behavior here, observe that every time you randomly take the wrong action, every hypothesis that gave that action higher expected utility than the best action will (on average) scale down its probability relative to $h_{t r u e}$ , If the best action gives an amount of reward bounded away from 0 more than all other actions, we can bound how much measure each hypothesis that disagrees with $h_{t r u e}$ must lose. Any hypothesis that disagrees with $h_{t r u e}$ on what is the best action infinitely often will lose all its relative measure.)

The AM-GM Boundary

In Thompson sampling, we have two levels of uncertainty over the environment. There is uncertainty between hypotheses, and there is uncertainty within hypotheses. The uncertainty within hypotheses is combined arithmetically (with $E$ ) to select a favorite action (distribution) for each hypothesis, and then the uncertainty between hypotheses is combined geometrically (with $G$ ) to select a favorite action distribution overall. This structure is actually important for exploring correctly here.

Similarly, in Nash bargaining, uncertainty within individuals is resolved arithmetically (to get each individual’s expected utilities), and uncertainty between individuals is resolved geometrically.

I call this distinction the AM-GM boundary (arithmetic mean/geometric mean boundary). There is a large space of stuff, and that stuff is combined using a geometric mean by default, but there are small clusters in which the stuff is combined using an arithmetic mean.

$G_{across clusters} E_{within each cluster} stuff$

There is a sort of boundary around the arithmetic mean clusters, where the outer geometric mean treats them as single units.

If you think of $G$ as enforcing proportional representation or fairness, then the arithmetic clusters are places where we say, “There is one person there. Its subparts are allowed to exploit each other.”

Exploiting here is referring to maximizing the total quantity of stuff with no regard to fairness or egalitarianism, or whether everyone in the cluster gets some.

The AIXI proposal above illustrates the failure mode of making the epistemic arithmetic clusters too big. There is one big arithmetic cluster in that proposal. The concept of a utility monster instrumentally demonstrates the same thing.

On the other hand, if you make your arithmetic clusters too small, you could end up taking actions in proportion to their utilities, and effectively never maximizing at all.

I think where you draw this boundary depends on context. There are places where I endorse my subagents exploiting each other, and other places where I do not. Similarly for my family, my larger tribe, my country, my species, and my Everett branch. There is a hard question here related to personal autonomy. I am only trying to give a language to talk about what is going on.

Similarly to talking about expanding circles of moral concern, we can talk about expanding circles of exploitation. Maybe eventually we want our circle of exploitation to be large. Maybe we don’t. Maybe even if we want it to be large eventually, we want to keep it small for now while we are still exploring and learning.

Geometric Exploration, Arithmetic Exploitation

A Exploration/​Exploitation Toy Model