@Vanessa Kosoy has a nice explanation of Level 4 uncertainties (a.k.a as Knightian uncertainty), in the context of her work on infra-Bayesianism. The following is from her AXRP podcast interview with @DanielFilan :
Daniel Filan: Okay. I guess this gets to a question that I have, which is, is the fact that we’re dealing with this convex sets of distributions … because that’s the main idea, and I’m wondering how that lets you deal with non-realizability, because it seems to me that if you have a convex set of probability distributions, in standard Bayesianism, you could just have a mixture distribution over all of that convex set, and you’ll do well on things that are inside your convex set, but you’ll do poorly on things that are outside your convex set. Yeah, can you give me a sense of how … Maybe this isn’t the thing that helps you deal with non-realizability, but if it is, how does it?
Vanessa Kosoy: The thing is, a convex set, you can think of it as some property that you think the world might have, right? Just let’s think of a trivial example. Suppose your world is a sequence of bits, so just an infinite sequence of bits, and one hypothesis you might have about the world is maybe all the even bits are equal to zero. This hypothesis doesn’t tell us anything about odd bits. It’s only a hypothesis about even bits, and it’s very easy to describe it as such a convex set. We just consider all probability distributions that predict that the odd bits will be zero with probability one, and without saying anything at all—the even bits, they can be anything. The behavior there can be anything.
Vanessa Kosoy: Okay, so what happens is, if instead of considering this convex set, you consider some distribution on this convex set, then you always get something which makes concrete predictions about the even bits. You can think about it in terms of computational complexity. All the probability distributions that you can actually work with have bounded computational complexity because you have bounded computational complexity. Therefore, as long as you’re assuming a probability distribution, a specific probability distribution, or it can be a prior over distributions, but that’s just the same thing. You can also average them, get one distribution. It’s like you’re assuming that the world has certain low computational complexity.
Vanessa Kosoy: One way to think of it is that Bayesian agents have a dogmatic belief that the world has low computational complexity. They believe this fact with probability one, because all their hypotheses have low computational complexity. You’re assigning probability one to this fact, and this is a wrong fact, and when you’re assigning probability one to something wrong, then it’s not surprising you run into trouble, right? Even Bayesians know this, but they can’t help it because there’s nothing you can do in Bayesianism to avoid it. With infra-Bayesianism, you can have some properties of the world, some aspects of the world can have low computational complexity, and other aspects of the world can have high complexity, or they can even be uncomputable. With this example with the bits, your hypothesis, it says that the odd bits are zero. The even bits, they can be uncomputable. They can be like the halting oracle or whatever. You’re not trying to have a prior over them because you know that you will fail, or at least you know that you might fail. That’s why you have different hypotheses in your prior.
Perhaps this is a nice explanation for some people with mathematical or statistical knowledge, but alas it goes way over my head.
(Specifically, I get lost here: “ We just consider all probability distributions that predict that the odd bits will be zero with probability one, and without saying anything at all—the even bits, they can be anything.”)
(Granted, I now at least think I understand what a convex set is, although I fail to understand its relevance in this conversation.)
In 1D, convex sets are just intervals, [a, b], and convex sets of probability distributions basically correspond to intervals of probability values, e.g. [0.1, 0.5], which are often called “imprecise probabilities”.
Let’s generalize this idea to 2D. There are two events, A and B, which I am uncertain about. If I were really confident, I could say that I think A happens with probability 0.2, and B happens with probability 0.8. But what if I feel so ignorant that I can’t assign a probability to event B? That means I think P(B) could be any probability between [0.0, 1.0], while keeping P(A) = 0.2. So my joint probability distribution P(A, B) is somewhere within the line segment (0.2, 0.0) to (0.2, 1.0). Line segments are a convex set.
You can generalize this notion to infinite dimensions—e.g. for a bit sequence of infinite length, specifying a complete probability distribution would require saying how probable each bit is likely to be equal to 1, conditioned on the values of all of the other bits. But we could instead only assign probabilities to the odd bits, not the even bits, and that would correspond to a convex set of probability distributions.
Hopefully that explains the convex set bit. The other part is why it’s better to use convex sets. Well, one reason is that sometimes we might be unwilling to specify a probability distribution, because we know the true underlying process is uncomputable. This problem arises, for example, when an agent is trying to simulate itself. I* can never perfectly simulate a copy of myself within my mind, even probabilistically, because that leads to infinite regress—this sort of paradox is related to the halting problem and Godel’s incompleteness theorem.
In at least these cases it seems better to say “I don’t know how to simulate this part of me”, rather pretending I can assign a computable distribution to how I will behave. For example, if I don’t know if I’m going to finish writing this comment in 5 minutes, I can assign it the imprecise probability [0.2, 1.0]. And then if I want to act safely, I just assume the worst case outcomes for the parts of me I don’t know how to simulate, and act accordingly. This applies to other parts of the world I can’t simulate as well—the physical world (which contains me), or simply other agents I have reason to believe are smarter than me.
(*I’m using “I” here, but I really mean some model or computer that is capable of more precise simulation and prediction than humans are capable of.)
Does it make more sense to think about all probability distributions that offers a probability of 50% for rain tomorrow? If we say this represents our epistemic state, then we’re saying something like “the probability of rain tomorrow is 50%, and we withhold judgement about rain on any other day”.
It feels more natural, but I’m unclear what this example is trying to prove. It still reads to me like “if we think rain is 50% likely tomorrow then it makes sense to say rain is 50% likely tomorrow” (which I realize is presumably not what is meant, but it’s how it feels).
@Vanessa Kosoy has a nice explanation of Level 4 uncertainties (a.k.a as Knightian uncertainty), in the context of her work on infra-Bayesianism. The following is from her AXRP podcast interview with @DanielFilan :
From: https://axrp.net/episode/2021/03/10/episode-5-infra-bayesianism-vanessa-kosoy.html
Perhaps this is a nice explanation for some people with mathematical or statistical knowledge, but alas it goes way over my head.
(Specifically, I get lost here: “ We just consider all probability distributions that predict that the odd bits will be zero with probability one, and without saying anything at all—the even bits, they can be anything.”)
(Granted, I now at least think I understand what a convex set is, although I fail to understand its relevance in this conversation.)
Fair point! Sorry it wasn’t the most helpful. My attempt at explaining a bit more below:
Convex sets are just sets where each point in the set can be expressed as weighted sum of the points on the exterior of the set, e.g.:
(source: https://reference.wolfram.com/language/ref/ConvexHullMesh.html)
In 1D, convex sets are just intervals, [a, b], and convex sets of probability distributions basically correspond to intervals of probability values, e.g. [0.1, 0.5], which are often called “imprecise probabilities”.
Let’s generalize this idea to 2D. There are two events, A and B, which I am uncertain about. If I were really confident, I could say that I think A happens with probability 0.2, and B happens with probability 0.8. But what if I feel so ignorant that I can’t assign a probability to event B? That means I think P(B) could be any probability between [0.0, 1.0], while keeping P(A) = 0.2. So my joint probability distribution P(A, B) is somewhere within the line segment (0.2, 0.0) to (0.2, 1.0). Line segments are a convex set.
You can generalize this notion to infinite dimensions—e.g. for a bit sequence of infinite length, specifying a complete probability distribution would require saying how probable each bit is likely to be equal to 1, conditioned on the values of all of the other bits. But we could instead only assign probabilities to the odd bits, not the even bits, and that would correspond to a convex set of probability distributions.
Hopefully that explains the convex set bit. The other part is why it’s better to use convex sets. Well, one reason is that sometimes we might be unwilling to specify a probability distribution, because we know the true underlying process is uncomputable. This problem arises, for example, when an agent is trying to simulate itself. I* can never perfectly simulate a copy of myself within my mind, even probabilistically, because that leads to infinite regress—this sort of paradox is related to the halting problem and Godel’s incompleteness theorem.
In at least these cases it seems better to say “I don’t know how to simulate this part of me”, rather pretending I can assign a computable distribution to how I will behave. For example, if I don’t know if I’m going to finish writing this comment in 5 minutes, I can assign it the imprecise probability [0.2, 1.0]. And then if I want to act safely, I just assume the worst case outcomes for the parts of me I don’t know how to simulate, and act accordingly. This applies to other parts of the world I can’t simulate as well—the physical world (which contains me), or simply other agents I have reason to believe are smarter than me.
(*I’m using “I” here, but I really mean some model or computer that is capable of more precise simulation and prediction than humans are capable of.)
Does it make more sense to think about all probability distributions that offers a probability of 50% for rain tomorrow? If we say this represents our epistemic state, then we’re saying something like “the probability of rain tomorrow is 50%, and we withhold judgement about rain on any other day”.
It feels more natural, but I’m unclear what this example is trying to prove. It still reads to me like “if we think rain is 50% likely tomorrow then it makes sense to say rain is 50% likely tomorrow” (which I realize is presumably not what is meant, but it’s how it feels).