Thank you for this post, overall I think it is interesting and relevant at least for my interests. There was one thing I wanted clarification on, however:
Level 4 uncertainties refer to a situation in which you know about what outcomes are possible but you do not know anything about their probability distributions, not even the ranking.
I’m often confused by these kinds of claims, as I don’t fully understand the assertion and/or problem here: if you genuinely cannot do better than assigning 1/n probability to each of n outcomes, then that is a legitimate distribution that you could use for expected-utility calculations. The reality is that oftentimes we do know at least slightly better than pure ignorance, but regardless, I’m just struggling to see why even pure ignorance is such a problem for expected utility calculations which acknowledge this state of affairs?
@Vanessa Kosoy has a nice explanation of Level 4 uncertainties (a.k.a as Knightian uncertainty), in the context of her work on infra-Bayesianism. The following is from her AXRP podcast interview with @DanielFilan :
Daniel Filan: Okay. I guess this gets to a question that I have, which is, is the fact that we’re dealing with this convex sets of distributions … because that’s the main idea, and I’m wondering how that lets you deal with non-realizability, because it seems to me that if you have a convex set of probability distributions, in standard Bayesianism, you could just have a mixture distribution over all of that convex set, and you’ll do well on things that are inside your convex set, but you’ll do poorly on things that are outside your convex set. Yeah, can you give me a sense of how … Maybe this isn’t the thing that helps you deal with non-realizability, but if it is, how does it?
Vanessa Kosoy: The thing is, a convex set, you can think of it as some property that you think the world might have, right? Just let’s think of a trivial example. Suppose your world is a sequence of bits, so just an infinite sequence of bits, and one hypothesis you might have about the world is maybe all the even bits are equal to zero. This hypothesis doesn’t tell us anything about odd bits. It’s only a hypothesis about even bits, and it’s very easy to describe it as such a convex set. We just consider all probability distributions that predict that the odd bits will be zero with probability one, and without saying anything at all—the even bits, they can be anything. The behavior there can be anything.
Vanessa Kosoy: Okay, so what happens is, if instead of considering this convex set, you consider some distribution on this convex set, then you always get something which makes concrete predictions about the even bits. You can think about it in terms of computational complexity. All the probability distributions that you can actually work with have bounded computational complexity because you have bounded computational complexity. Therefore, as long as you’re assuming a probability distribution, a specific probability distribution, or it can be a prior over distributions, but that’s just the same thing. You can also average them, get one distribution. It’s like you’re assuming that the world has certain low computational complexity.
Vanessa Kosoy: One way to think of it is that Bayesian agents have a dogmatic belief that the world has low computational complexity. They believe this fact with probability one, because all their hypotheses have low computational complexity. You’re assigning probability one to this fact, and this is a wrong fact, and when you’re assigning probability one to something wrong, then it’s not surprising you run into trouble, right? Even Bayesians know this, but they can’t help it because there’s nothing you can do in Bayesianism to avoid it. With infra-Bayesianism, you can have some properties of the world, some aspects of the world can have low computational complexity, and other aspects of the world can have high complexity, or they can even be uncomputable. With this example with the bits, your hypothesis, it says that the odd bits are zero. The even bits, they can be uncomputable. They can be like the halting oracle or whatever. You’re not trying to have a prior over them because you know that you will fail, or at least you know that you might fail. That’s why you have different hypotheses in your prior.
Perhaps this is a nice explanation for some people with mathematical or statistical knowledge, but alas it goes way over my head.
(Specifically, I get lost here: “ We just consider all probability distributions that predict that the odd bits will be zero with probability one, and without saying anything at all—the even bits, they can be anything.”)
(Granted, I now at least think I understand what a convex set is, although I fail to understand its relevance in this conversation.)
In 1D, convex sets are just intervals, [a, b], and convex sets of probability distributions basically correspond to intervals of probability values, e.g. [0.1, 0.5], which are often called “imprecise probabilities”.
Let’s generalize this idea to 2D. There are two events, A and B, which I am uncertain about. If I were really confident, I could say that I think A happens with probability 0.2, and B happens with probability 0.8. But what if I feel so ignorant that I can’t assign a probability to event B? That means I think P(B) could be any probability between [0.0, 1.0], while keeping P(A) = 0.2. So my joint probability distribution P(A, B) is somewhere within the line segment (0.2, 0.0) to (0.2, 1.0). Line segments are a convex set.
You can generalize this notion to infinite dimensions—e.g. for a bit sequence of infinite length, specifying a complete probability distribution would require saying how probable each bit is likely to be equal to 1, conditioned on the values of all of the other bits. But we could instead only assign probabilities to the odd bits, not the even bits, and that would correspond to a convex set of probability distributions.
Hopefully that explains the convex set bit. The other part is why it’s better to use convex sets. Well, one reason is that sometimes we might be unwilling to specify a probability distribution, because we know the true underlying process is uncomputable. This problem arises, for example, when an agent is trying to simulate itself. I* can never perfectly simulate a copy of myself within my mind, even probabilistically, because that leads to infinite regress—this sort of paradox is related to the halting problem and Godel’s incompleteness theorem.
In at least these cases it seems better to say “I don’t know how to simulate this part of me”, rather pretending I can assign a computable distribution to how I will behave. For example, if I don’t know if I’m going to finish writing this comment in 5 minutes, I can assign it the imprecise probability [0.2, 1.0]. And then if I want to act safely, I just assume the worst case outcomes for the parts of me I don’t know how to simulate, and act accordingly. This applies to other parts of the world I can’t simulate as well—the physical world (which contains me), or simply other agents I have reason to believe are smarter than me.
(*I’m using “I” here, but I really mean some model or computer that is capable of more precise simulation and prediction than humans are capable of.)
Does it make more sense to think about all probability distributions that offers a probability of 50% for rain tomorrow? If we say this represents our epistemic state, then we’re saying something like “the probability of rain tomorrow is 50%, and we withhold judgement about rain on any other day”.
It feels more natural, but I’m unclear what this example is trying to prove. It still reads to me like “if we think rain is 50% likely tomorrow then it makes sense to say rain is 50% likely tomorrow” (which I realize is presumably not what is meant, but it’s how it feels).
I think assigning 1/n typically depends on evidential symmetry (like simple cluelessness) or at least that the reasons all balance out precisely, so rules out complex cluelessness. Instead, we might have evidence for and against each possibility, but be unable to weigh it all without making very arbitrary assumptions, so we wouldn’t be wiling to commit to the belief that A is more likely than B or vice versa or that they’re equally likely. There’s an illustrative example here.
On balance, I’m extremely uncertain about the net impact of climate change on wild-animal suffering; my probabilities are basically 50% net good vs. 50% net bad when just considering animal suffering on Earth in the next few centuries (ignoring side effects on humanity’s very long-term future).
But if he had built formal models with precise probabilities, it would almost certainly have come out with climate change bad in expectation or climate change good in expectation, rather than net neutral in expectation, and the expected impact could be (but wouldn’t necessarily be) very very large. But someone else with slightly different (but pretty arbitrary) precise probabilities could get the opposite sign and still huge expected impact. It would seem bad to bet a lot on one side if the sign and magnitude of the expected value is sensitive to arbitrarily chosen numbers.
Even if multiple people come up with different numbers and we want to weigh them, there’s still a question of how exactly to weigh them given possibly different levels of relevant expertise and bias between them, so 1/n is probably wrong, but all other approaches to come up with single precise numbers are going to involve arbitrary parameters/weights.
But someone else with slightly different (but pretty arbitrary) precise probabilities could get the opposite sign and still huge expected impact. It would seem bad to bet a lot on one side if the sign and magnitude of the expected value is sensitive to arbitrarily chosen numbers.
I wonder if the problem here is a failure to disentangle “what is our best estimate currently” and “what do we expect is the value of doing further analysis, given how fragile our current estimates are.”
If my research agent Alice said “I think there’s a 50% chance that doing X leads to +2,000,000,000 utils and a 50% chance that doing X leads to −1,000,000,000 utils (and the same probabilities that not doing X leads to the opposite outcomes), but these probability estimates are currently just pure 1/n uncertainty; such estimates could easily shift over time pending further analysis” I would probably say “wow I don’t like the uncertainty here, can we maybe do further analysis to make sure we’re right before choosing to do X?”
In other words, the concern seems to be that you don’t want to misrepresent the potential for new information to change your estimates.
However, suppose Alice actually says “… and no matter how much more research effort we apply (within real-world constraints) we are confident that our probability estimates will not meaningfully change.” At that point, there is no chance at improving, so you are stuck with pure, 1/n ignorance.
Perhaps I’m just unclear what it would even mean to be in a situation where you “can’t” put a probability estimate on things that does as good as or better than pure 1/n ignorance. I can understand the claim that in some scenarios you perhaps “shouldn’t” because it risks miscommunicating about the potential value of trying to improve your probability estimates, but that doesn’t seem like an insurmountable problem (I.e., we could develop better terms and communication norms for this)?
(and the same probabilities that not doing X leads to the opposite outcomes)
I’m not sure exactly what you mean by this, and I expect this will make it more complicated to think about than just giving utility differences with the counterfactual.
The idea of sensitivity to new information has been called credal resilience/credal fragility, but the problem I’m concerned with is having justified credences. I would often find it deeply unsatisfying (i.e. it seems unjustifiable) to represent my beliefs with a single probability distribution; I’d feel like I’m pulling numbers out of my ass, and I don’t think we should base important decisions on such numbers. So, I’d often rather give ranges for my probabilities. You literally can give single distributions/precise probabilities, but it seems unjustifiable, overconfident and silly.
If you haven’t already, I’d recommend reading the illustrative example here. I’d say it’s not actually justifiable to assign precisely 50-50 in that case or in almost any realistic situation that actually matters, because:
if you actually tried to build a model, it would be extraordinarily unlikely for you to get 50-50 unless you specifically pick your model parameters to get that result (which would be motivated reasoning and kind of defeat the purpose of building the model in the first place) or round the results, given that the evidence isn’t symmetric and you’d have multiple continuous parameters.
if you thought 50-50 was a good estimate before the evidential sweetening, then you can’t use 50-50 after, even though it seems just as appropriate for it. Furthermore, if you would have used 50-50 if originally presented with the sweetened information, then your beliefs depend on the timing/order in which you become aware of evidence (say you just miscounted witnesses the first time), which should be irrelevant and is incompatible with Bayesian rationality (unless you have specific reasons for dependence on the timing/order).
For the same reasons, in almost any realistic situation that actually matters, Alice in your example could not justifiably get 50-50. And in general, you shouldn’t get numbers with short exact decimal or fractional representations.
So, say in your example, it comes out 51.28… to 48.72..., but could have gone the other way under different reasonable parameter assignments; those are just the ones Alice happened to pickat that particular time. Maybe she also tells you it seems pretty arbitrary, and she could imagine having come up with the opposite conclusion and probabilities much further from 50-50 in each direction. And that she doesn’t have a best guess, because, again, it seems too arbitrary.
How would you respond if there isn’t enough time to investigate further? But you could instead support something that seems cost-effective without being so sensitive to pretty arbitrary parameter assignments, but not nearly as cost-effective as Alice’s intervention or an intervention doing the opposite.
Also imagine Bob gets around 47-53, and agrees with Alice about the arbitrariness and reasonable ranges. Furthermore, you can’t weigh Alice and Bob’s distributions evenly, because Alice has slightly more experience as a researcher and/or a slightly better score in forecasting, so you should give her estimate more weight.
The notion of 1/n probability breaks kind of down if you look an infinite number of scenarios or uncertainty values (if you talk about one particular uncertain variable). For example, let’s take population growth in economic models. Depending on your model and potential sensitivities to initial conditions, the resolution of this variable matters. For some context, the current population growth is at 1.1% per annum. But we might be uncertain about how this will develop in the future. Maybe 1.0%? Maybe 1.2%? Maybe that the resolution of 0.1% is enough. And this case, what range would feel comfortable to put a probability distribution over? [0.6, 1.5] maybe? So, that n=10 and with a uniform distribution, you get 1.4% population growth to be 10% likely? But what if minor changes are important? You end up with an infinite number of potential values – even if you restrict the range of possible values. How do we square this situation with the 1/n approach? I’m uncertain.
My other point is more a disclaimer. I’m not advocating for throwing out expected-utility thinking completely. And I’m still a Bayesian at heart (which sometimes means that I pull numbers out my behind^^). My point is that it is sometimes problematic to use a model, run it in a few configurations (i.e. for a few scenarios), calculate a weighted average of the outcomes and call it a day. This is especially problematic if we look at complex systems and models in which non-linearities are compounding quickly. If you have 10 uncertainty variables, each of them of type float with huge ranges of plausible values, how do you decide what scenarios (points in uncertainty space) to run? Posteriori weighted averaging likely fails to capture the complex interactions and the outcome distributions. What I’m trying to say is that I’m still going to assume probabilities and probability distributions in daily life. And I will still conduct expected utility calculations. However, when things get more complex (e.g. in model land), I might advocate for more caution.
I’m not sure I understand the concern with (1); I would first say that I think infinities are occasionally thrown around too lightly, and in this example it seems like it might be unjustified to say there are infinite possible values, especially since we are talking about units of people/population (which is composed of finite matter and discrete units). Moreover, the actual impact of a difference between 1.0000000000002% and 1.00000000000001% in most values seems unimportant for practical decision-making considerations—which, notably, are not made with infinite computation and data and action capabilities—even if it is theoretically possible to have such a difference. If something like that which seems so small is actually meaningful (e.g., it flips signs), however, then that might update you towards beliefs like “within analytical constraints the current analysis points to [balancing out |OR| one side being favored].” In other words, perhaps not pure uncertainty, since now you plausibly have some information that leans one way or another (with some caveats I won’t get into).
I think I would agree to some extent with (2). My main concern is mostly that I see people write things that (seemingly) make it sound like you just logically can’t do expected utility calculations when you face something like pure uncertainty; you just logically have to put a “?” in your models instead of “1/n,” which just breaks the whole model. Sometimes (like the examples I mentioned), the rest of the model is fine!
I contest that you can use “1/n”, it’s more just a matter of “should you do so given that you run the risk of misleading yourself or your audience towards X, Y, and Z failure modes (e.g., downplaying the value of doing further analysis, putting too many eggs in one basket/ignoring non-linear utility functions, creating bad epistemic cultures which disincentivize people from speaking out against overconfidence, …).”
In other words, I would prefer to see clearer disentangling of epistemic/logical claims from strategic/communication claims.
“While useful, even models that produced a perfect probability density function for precisely selected outcomes would not prove sufficient to answer such questions. Nor are they necessary.”
I recommend reading DMDU since it goes into much more detail than I can do justice.
Yet, I believe you are focusing heavily on the concept of the distribution existing while the claim should be restated.
Deep uncertainty implies that the range of reasonable distributions allows so many reasonable decisions that attempting to “agree on assumptions then act” is a poor frame. Instead, you want to explore all reasonable distributions then “agree on decisions”.
If you are in a state where reasonable people are producing meaningfully different decisions, ie different sign from your convention above, based on the distribution and weighting terms. Then it becomes more useful to focus on the timeline and tradeoffs rather than the current understanding of the distribution:
Explore the largest range of scenarios (in the 1/n case each time you add another plausible scenario it changes all scenario weights)
Understand the sequence of actions/information released
Identify actions that won’t change with new info
Identify information that will meaningfully change your decision
Identify actions that should follow given the new information
Quantify tradeoffs forced with decisions
This results is building an adapting policy pathway rather than making a decision or even choosing a model framework.
Value is derived from expanding the suite of policies, scenarios and objectives or illustrating the tradeoffs between objectives and how to minimize those tradeoffs via sequencing.
This is in contrast to emphasizing the optimal distribution (or worse, point estimate) conditional on all available data. Since that distribution is still subject to change in time and evaluated under different weights by different stakeholders.
I’m not sure exactly what you mean by this, and I expect this will make it more complicated to think about than just giving utility differences with the counterfactual.
I just added this in hastily to address any objection that says something like “What if I’m risk averse and prefer a 100% chance of getting 0 utility instead of an x% chance of getting very negative utility.” It would probably have been better to just say something like “ignore risk aversion and non-linear utility.”
I would often find it deeply unsatisfying (i.e. it seems unjustifiable) to represent my beliefs with a single probability distribution; I’d feel like I’m pulling numbers out of my ass, and I don’t think we should base important decisions on such numbers. So, I’d often rather give ranges for my probabilities. You literally can give single distributions/precise probabilities, but it seems unjustifiable, overconfident and silly.
I think this boils down to my point about the fear of miscommunicating—the questions like “how should I communicate my findings,” “what do my findings say about doing further analysis,” and “what are my findings current best-guess estimates.” If you think it goes beyond that—that it is actually “intrinsically incorrect-as-written,” I could write up a longer reply elaborating on the following: I’d pose the question back at you and ask whether it’s really justified or optimal to include ambiguity-laden “ranges” assuming there will be no miscommunication risks (e.g., nobody assumes “he said 57.61% so he must be very confident he’s right and doing more analysis won’t be useful”)? If you say “there’s a 1%-99% chance that a given coin will land on heads” because the coin is weighted but you don’t know whether it’s for heads or tails, how is this functionally any different from saying “my best guess is that on one flip the coin has a 50% chance of landing on heads”? (Again, I could elaborate further if needed)
if you actually tried to build a model, it would be extraordinarily unlikely for you to get 50-50
Sure, I agree. But that doesn’t change the decision in the example I gave, at least when you leave it at “upon further investigation it’s actually about 51-49.” In either case, the expected benefit-cost ratio is still roughly around 2:1. When facing analytical constraints and for this purely theoretical case, it seems optimal to do the 1/n estimate rather than “NaN” or “” or “???” which breaks your whole model and prevents you from calculating anything, so long as you’re setting aside all miscommunication risks (which was the main point of my comment: to try to disentangle miscommunication and related risks from the ability to use 1/n probabilities as a default optimal). To paraphrase what I said for a different comment, in the real world maybe it is better to just throw a wrench in the whole model and say “dear principal: no, stop, we need to disengage autopilot and think longer.” But I’m not at the real world yet, because I want to make sure I am clear on why I see so many people say things like you can’t give probability estimates for pure uncertainty (when in reality it seems nothing is certain anyway and thus you can’t give 100.0% “true” point or range estimates for anything).
Perhaps I’m just unclear what it would even mean to be in a situation where you “can’t” put a probability estimate on things that does as good as or better than pure 1/n ignorance.
Suppose you think you might come up with new hypotheses in the future which will cause you to reevaluate how the existing evidence supports your current hypotheses. In this case probabilistically modelling the phenomenon doesn’t necessarily get you the right “value of further investigation” (because you’re not modelling hypothesis X), but you might still be well advised to hold off acting and investigate further—collecting more data might even be what leads to you thinking of the new hypothesis, leading to a “non Bayesian update”. That said, I think you could separately estimate the probability of a revision of this type.
Similarly, you might discover a new outcome that’s important that you’d previously neglected to include in your models.
One more thing: because probability is difficult to work with, even if it is in principle compatible with adaptive plans, it might in practice tend to steer away from them.
In this case probabilistically modelling the phenomenon doesn’t necessarily get you the right “value of further investigation” (because you’re not modelling hypothesis X)
I basically agree (although it might provide a decent amount of information to this end), but this does not reject the idea that you can make a probability estimate equally or more accurate than pure 1/n uncertainty.
Ultimately, if you want to focus on “what is the expected value of doing further analyses to improve my probability estimates,” I say go for it. You often shouldn’t default to accepting pure 1/n ignorance. But I still can’t imagine a situation that truly matches “Level 4 or Level 5 Uncertainty,” where there is nothing as good as or better than pure 1/n ignorance. If you truly know absolutely and purely nothing about a probability distribution—which almost never happens—then it seems 1/n estimates will be the default optimal distribution, because anything else would require being able to offer supposedly-nonexistent information to justify that conclusion.
Ultimately, a better framing (to me) would seem like “if you find yourself at 1/n ignorance, you should be careful not to accept that as a legitimate probability estimate unless you are really rock solid confident it won’t improve.” No?
I think this question—whether it’s better to take 1/n probabilities (or maximum entropy distributions or whatever) or to adopt some “deep uncertainty” strategy—does not have an obvious answer
I actually think it probably (pending further objections) does have a somewhat straightforward answer with regards to the rather narrow, theoretical cases that I have in mind, which relate to the confusion I had which started this comment chain.
It’s hard to accurately convey the full degree of my caveats/specifications, but one simple example is something like “Suppose you are forced to choose whether to do X or nothing (Y). You are purely uncertain whether X will lead to outcome Great (Q), Good (P), or Bad (W), and there is guaranteed to be no way to get further information on this. However, you can safely assume that outcome Q is guaranteed to lead to +1,000 utils, P is guaranteed to lead to +500 utils, and W is guaranteed to lead to −500 utils. Doing nothing is guaranteed to lead to 0 utils. What should you do, assuming utils do not have non-linear effects?”
In this scenario, it seems very clear to me that a strategy of “do nothing” is inferior to doing X: even though you don’t know what the actual probabilities of Q, P, and W are, I don’t understand how the 1/n default will fail to work (across a sufficiently large number of 1/n cases). And when taking the 1/n estimate as a default, the expected utility is positive.
Of course, outside of barebones theoretical examples (I.e., in the real world) I don’t think there is a simple, straightforward algorithm for deciding when to pursue more information vs. act on limited information with significant uncertainty.
Good point! I think this is also a matter of risk aversion. How severe is it to get to a state of −500 utils? If you are very risk-averse, it might be better to do nothing. But I cannot make such a blanket statement.
I’d like to emphasize at this point that the DMDU approach is trying to avoid to
test the performance of a set of policies for a set number of scenarios,
decide how likely each scenario is (this is the crux), and
calculate some weighted average for each policy.
Instead, we use DMDU to consider the full range of plausible scenarios to explore and identify particularly vulnerable scenarios. We want to pay special attention to these scenarios and find optimal and robust solutions for them. Like this, we cover tail risks which is quite in line IMO with mitigation efforts of GCRs, x-risks, and s-risks.
If you truly know absolutely and purely nothing about a probability distribution—which almost never happens
I would disagree with this particular statement. I’m not saying the opposite either. I think, it’s reasonable in a lot of cases to assume some probability distributions. However, there are lot of cases, where we just do not know at all. E.g., take the space of possible minds. What’s our probability distribution of our first AGI over this space? I personally don’t know. Even looking at binary events – What’s our probability distribution for AI x-risk this century? 10%? I find this widely used number implausible.
But I agree that we can try gathering more information to get more clarity on that. What is often done in DMDU analysis is that we figure out that some uncertainty variables don’t have much of an impact on our system anyway (so we fix the variables to some value) or that we constrain their value ranges to focus on more relevant subspaces. The DMDU framework does not necessitate or advocate for total ignorance. I think, there is room for an in-between.
A friend of mine just mentioned to me that the following points could be useful in the context of this discussion.
What DMDU researchers are usually doing is to use uniform probability distributions for all parameters when exploring future scenarios. This approach allows for a more even exploration of the plausible space, rather than being overly concerned with subjective probabilities, which may lead to sampling some regions of input-output space less densely and potentially missing decision-relevant outcomes. The benefit of using uniform probability distributions is that it can help to avoid compounding uncertainties in a way that can lead to biased results. When you use a uniform distribution, you assume that all values are equally likely within the range of possible outcomes. This approach can help to ensure that your exploration of the future is more comprehensive and that you are not overlooking important possibilities. Of course, there may be cases where subjective probabilities are essential, such as when there is prior knowledge or data that strongly suggests certain outcomes are more likely than others. In such cases, I’d say that it may be appropriate to incorporate those probabilities into the model.
Also, this paper by James Derbyshire on probability-based versus plausibility-based scenarios might be very relevant. The underlying idea of plausibility-based scenarios is that any technically possible outcome of a model is plausible in the real world, regardless of its likelihood (given that the model has been well validated). This approach recognizes that complex systems, especially those with deep uncertainties, can produce unexpected outcomes that may not have been considered in a traditional probability-based approach. When making decisions under deep uncertainty, it’s important to take seriously the range of technically possible but seemingly unlikely outcomes. This is where the precautionary principle comes in (which advocates for taking action to prevent harm even when there is uncertainty about the likelihood of that harm). By including these “fat tail” outcomes in our analysis, we are able to identify and prepare for potentially severe outcomes that may have significant consequences. Additionally, nonlinearities can further complicate the relationship between probability and plausibility. In some cases, even a small change in initial conditions or inputs can lead to drastic differences in the final outcome. By exploring the range of plausible outcomes rather than just the most likely outcomes, we can better understand the potential consequences of our decisions and be more prepared to mitigate risks and respond to unexpected events.
I’m not sure I disagree with any of this, and in fact if I understood correctly, the point about using uniform probability distributions is basically what I was suggesting: it seems like you can always put 1/n instead of a “?” which just breaks your model. I agree that sometimes it’s better to say “?” and break the model because you don’t always want to analyze complex things on autopilot through uncertainty (especially if there’s a concern that your audience will misinterpret your findings), but sometimes it is better to just say “we need to put something in, so let’s put 1/n and flag it for future analysis/revision.”
Thank you for this post, overall I think it is interesting and relevant at least for my interests. There was one thing I wanted clarification on, however:
I’m often confused by these kinds of claims, as I don’t fully understand the assertion and/or problem here: if you genuinely cannot do better than assigning 1/n probability to each of n outcomes, then that is a legitimate distribution that you could use for expected-utility calculations. The reality is that oftentimes we do know at least slightly better than pure ignorance, but regardless, I’m just struggling to see why even pure ignorance is such a problem for expected utility calculations which acknowledge this state of affairs?
@Vanessa Kosoy has a nice explanation of Level 4 uncertainties (a.k.a as Knightian uncertainty), in the context of her work on infra-Bayesianism. The following is from her AXRP podcast interview with @DanielFilan :
From: https://axrp.net/episode/2021/03/10/episode-5-infra-bayesianism-vanessa-kosoy.html
Perhaps this is a nice explanation for some people with mathematical or statistical knowledge, but alas it goes way over my head.
(Specifically, I get lost here: “ We just consider all probability distributions that predict that the odd bits will be zero with probability one, and without saying anything at all—the even bits, they can be anything.”)
(Granted, I now at least think I understand what a convex set is, although I fail to understand its relevance in this conversation.)
Fair point! Sorry it wasn’t the most helpful. My attempt at explaining a bit more below:
Convex sets are just sets where each point in the set can be expressed as weighted sum of the points on the exterior of the set, e.g.:
(source: https://reference.wolfram.com/language/ref/ConvexHullMesh.html)
In 1D, convex sets are just intervals, [a, b], and convex sets of probability distributions basically correspond to intervals of probability values, e.g. [0.1, 0.5], which are often called “imprecise probabilities”.
Let’s generalize this idea to 2D. There are two events, A and B, which I am uncertain about. If I were really confident, I could say that I think A happens with probability 0.2, and B happens with probability 0.8. But what if I feel so ignorant that I can’t assign a probability to event B? That means I think P(B) could be any probability between [0.0, 1.0], while keeping P(A) = 0.2. So my joint probability distribution P(A, B) is somewhere within the line segment (0.2, 0.0) to (0.2, 1.0). Line segments are a convex set.
You can generalize this notion to infinite dimensions—e.g. for a bit sequence of infinite length, specifying a complete probability distribution would require saying how probable each bit is likely to be equal to 1, conditioned on the values of all of the other bits. But we could instead only assign probabilities to the odd bits, not the even bits, and that would correspond to a convex set of probability distributions.
Hopefully that explains the convex set bit. The other part is why it’s better to use convex sets. Well, one reason is that sometimes we might be unwilling to specify a probability distribution, because we know the true underlying process is uncomputable. This problem arises, for example, when an agent is trying to simulate itself. I* can never perfectly simulate a copy of myself within my mind, even probabilistically, because that leads to infinite regress—this sort of paradox is related to the halting problem and Godel’s incompleteness theorem.
In at least these cases it seems better to say “I don’t know how to simulate this part of me”, rather pretending I can assign a computable distribution to how I will behave. For example, if I don’t know if I’m going to finish writing this comment in 5 minutes, I can assign it the imprecise probability [0.2, 1.0]. And then if I want to act safely, I just assume the worst case outcomes for the parts of me I don’t know how to simulate, and act accordingly. This applies to other parts of the world I can’t simulate as well—the physical world (which contains me), or simply other agents I have reason to believe are smarter than me.
(*I’m using “I” here, but I really mean some model or computer that is capable of more precise simulation and prediction than humans are capable of.)
Does it make more sense to think about all probability distributions that offers a probability of 50% for rain tomorrow? If we say this represents our epistemic state, then we’re saying something like “the probability of rain tomorrow is 50%, and we withhold judgement about rain on any other day”.
It feels more natural, but I’m unclear what this example is trying to prove. It still reads to me like “if we think rain is 50% likely tomorrow then it makes sense to say rain is 50% likely tomorrow” (which I realize is presumably not what is meant, but it’s how it feels).
I think assigning 1/n typically depends on evidential symmetry (like simple cluelessness) or at least that the reasons all balance out precisely, so rules out complex cluelessness. Instead, we might have evidence for and against each possibility, but be unable to weigh it all without making very arbitrary assumptions, so we wouldn’t be wiling to commit to the belief that A is more likely than B or vice versa or that they’re equally likely. There’s an illustrative example here.
Similarly, Brian Tomasik claimed, after looking into many different effects and considerations:
But if he had built formal models with precise probabilities, it would almost certainly have come out with climate change bad in expectation or climate change good in expectation, rather than net neutral in expectation, and the expected impact could be (but wouldn’t necessarily be) very very large. But someone else with slightly different (but pretty arbitrary) precise probabilities could get the opposite sign and still huge expected impact. It would seem bad to bet a lot on one side if the sign and magnitude of the expected value is sensitive to arbitrarily chosen numbers.
Even if multiple people come up with different numbers and we want to weigh them, there’s still a question of how exactly to weigh them given possibly different levels of relevant expertise and bias between them, so 1/n is probably wrong, but all other approaches to come up with single precise numbers are going to involve arbitrary parameters/weights.
I wonder if the problem here is a failure to disentangle “what is our best estimate currently” and “what do we expect is the value of doing further analysis, given how fragile our current estimates are.”
If my research agent Alice said “I think there’s a 50% chance that doing X leads to +2,000,000,000 utils and a 50% chance that doing X leads to −1,000,000,000 utils (and the same probabilities that not doing X leads to the opposite outcomes), but these probability estimates are currently just pure 1/n uncertainty; such estimates could easily shift over time pending further analysis” I would probably say “wow I don’t like the uncertainty here, can we maybe do further analysis to make sure we’re right before choosing to do X?”
In other words, the concern seems to be that you don’t want to misrepresent the potential for new information to change your estimates.
However, suppose Alice actually says “… and no matter how much more research effort we apply (within real-world constraints) we are confident that our probability estimates will not meaningfully change.” At that point, there is no chance at improving, so you are stuck with pure, 1/n ignorance.
Perhaps I’m just unclear what it would even mean to be in a situation where you “can’t” put a probability estimate on things that does as good as or better than pure 1/n ignorance. I can understand the claim that in some scenarios you perhaps “shouldn’t” because it risks miscommunicating about the potential value of trying to improve your probability estimates, but that doesn’t seem like an insurmountable problem (I.e., we could develop better terms and communication norms for this)?
I’m not sure exactly what you mean by this, and I expect this will make it more complicated to think about than just giving utility differences with the counterfactual.
The idea of sensitivity to new information has been called credal resilience/credal fragility, but the problem I’m concerned with is having justified credences. I would often find it deeply unsatisfying (i.e. it seems unjustifiable) to represent my beliefs with a single probability distribution; I’d feel like I’m pulling numbers out of my ass, and I don’t think we should base important decisions on such numbers. So, I’d often rather give ranges for my probabilities. You literally can give single distributions/precise probabilities, but it seems unjustifiable, overconfident and silly.
If you haven’t already, I’d recommend reading the illustrative example here. I’d say it’s not actually justifiable to assign precisely 50-50 in that case or in almost any realistic situation that actually matters, because:
if you actually tried to build a model, it would be extraordinarily unlikely for you to get 50-50 unless you specifically pick your model parameters to get that result (which would be motivated reasoning and kind of defeat the purpose of building the model in the first place) or round the results, given that the evidence isn’t symmetric and you’d have multiple continuous parameters.
if you thought 50-50 was a good estimate before the evidential sweetening, then you can’t use 50-50 after, even though it seems just as appropriate for it. Furthermore, if you would have used 50-50 if originally presented with the sweetened information, then your beliefs depend on the timing/order in which you become aware of evidence (say you just miscounted witnesses the first time), which should be irrelevant and is incompatible with Bayesian rationality (unless you have specific reasons for dependence on the timing/order).
For the same reasons, in almost any realistic situation that actually matters, Alice in your example could not justifiably get 50-50. And in general, you shouldn’t get numbers with short exact decimal or fractional representations.
So, say in your example, it comes out 51.28… to 48.72..., but could have gone the other way under different reasonable parameter assignments; those are just the ones Alice happened to pick at that particular time. Maybe she also tells you it seems pretty arbitrary, and she could imagine having come up with the opposite conclusion and probabilities much further from 50-50 in each direction. And that she doesn’t have a best guess, because, again, it seems too arbitrary.
How would you respond if there isn’t enough time to investigate further? But you could instead support something that seems cost-effective without being so sensitive to pretty arbitrary parameter assignments, but not nearly as cost-effective as Alice’s intervention or an intervention doing the opposite.
Also imagine Bob gets around 47-53, and agrees with Alice about the arbitrariness and reasonable ranges. Furthermore, you can’t weigh Alice and Bob’s distributions evenly, because Alice has slightly more experience as a researcher and/or a slightly better score in forecasting, so you should give her estimate more weight.
Great to see people digging into the crucial assumptions!
In my view, @MichaelStJules makes great counter points to @Harrison Durland’s objection. I would like to add to further points.
The notion of 1/n probability breaks kind of down if you look an infinite number of scenarios or uncertainty values (if you talk about one particular uncertain variable). For example, let’s take population growth in economic models. Depending on your model and potential sensitivities to initial conditions, the resolution of this variable matters. For some context, the current population growth is at 1.1% per annum. But we might be uncertain about how this will develop in the future. Maybe 1.0%? Maybe 1.2%? Maybe that the resolution of 0.1% is enough. And this case, what range would feel comfortable to put a probability distribution over? [0.6, 1.5] maybe? So, that n=10 and with a uniform distribution, you get 1.4% population growth to be 10% likely? But what if minor changes are important? You end up with an infinite number of potential values – even if you restrict the range of possible values. How do we square this situation with the 1/n approach? I’m uncertain.
My other point is more a disclaimer. I’m not advocating for throwing out expected-utility thinking completely. And I’m still a Bayesian at heart (which sometimes means that I pull numbers out my behind^^). My point is that it is sometimes problematic to use a model, run it in a few configurations (i.e. for a few scenarios), calculate a weighted average of the outcomes and call it a day. This is especially problematic if we look at complex systems and models in which non-linearities are compounding quickly. If you have 10 uncertainty variables, each of them of type float with huge ranges of plausible values, how do you decide what scenarios (points in uncertainty space) to run? Posteriori weighted averaging likely fails to capture the complex interactions and the outcome distributions. What I’m trying to say is that I’m still going to assume probabilities and probability distributions in daily life. And I will still conduct expected utility calculations. However, when things get more complex (e.g. in model land), I might advocate for more caution.
I’m not sure I understand the concern with (1); I would first say that I think infinities are occasionally thrown around too lightly, and in this example it seems like it might be unjustified to say there are infinite possible values, especially since we are talking about units of people/population (which is composed of finite matter and discrete units). Moreover, the actual impact of a difference between 1.0000000000002% and 1.00000000000001% in most values seems unimportant for practical decision-making considerations—which, notably, are not made with infinite computation and data and action capabilities—even if it is theoretically possible to have such a difference. If something like that which seems so small is actually meaningful (e.g., it flips signs), however, then that might update you towards beliefs like “within analytical constraints the current analysis points to [balancing out |OR| one side being favored].” In other words, perhaps not pure uncertainty, since now you plausibly have some information that leans one way or another (with some caveats I won’t get into).
I think I would agree to some extent with (2). My main concern is mostly that I see people write things that (seemingly) make it sound like you just logically can’t do expected utility calculations when you face something like pure uncertainty; you just logically have to put a “?” in your models instead of “1/n,” which just breaks the whole model. Sometimes (like the examples I mentioned), the rest of the model is fine!
I contest that you can use “1/n”, it’s more just a matter of “should you do so given that you run the risk of misleading yourself or your audience towards X, Y, and Z failure modes (e.g., downplaying the value of doing further analysis, putting too many eggs in one basket/ignoring non-linear utility functions, creating bad epistemic cultures which disincentivize people from speaking out against overconfidence, …).”
In other words, I would prefer to see clearer disentangling of epistemic/logical claims from strategic/communication claims.
“While useful, even models that produced a perfect probability density function for precisely selected outcomes would not prove sufficient to answer such questions. Nor are they necessary.”
I recommend reading DMDU since it goes into much more detail than I can do justice.
Yet, I believe you are focusing heavily on the concept of the distribution existing while the claim should be restated.
Deep uncertainty implies that the range of reasonable distributions allows so many reasonable decisions that attempting to “agree on assumptions then act” is a poor frame. Instead, you want to explore all reasonable distributions then “agree on decisions”.
If you are in a state where reasonable people are producing meaningfully different decisions, ie different sign from your convention above, based on the distribution and weighting terms. Then it becomes more useful to focus on the timeline and tradeoffs rather than the current understanding of the distribution:
Explore the largest range of scenarios (in the 1/n case each time you add another plausible scenario it changes all scenario weights)
Understand the sequence of actions/information released
Identify actions that won’t change with new info
Identify information that will meaningfully change your decision
Identify actions that should follow given the new information
Quantify tradeoffs forced with decisions
This results is building an adapting policy pathway rather than making a decision or even choosing a model framework.
Value is derived from expanding the suite of policies, scenarios and objectives or illustrating the tradeoffs between objectives and how to minimize those tradeoffs via sequencing.
This is in contrast to emphasizing the optimal distribution (or worse, point estimate) conditional on all available data. Since that distribution is still subject to change in time and evaluated under different weights by different stakeholders.
I just added this in hastily to address any objection that says something like “What if I’m risk averse and prefer a 100% chance of getting 0 utility instead of an x% chance of getting very negative utility.” It would probably have been better to just say something like “ignore risk aversion and non-linear utility.”
I think this boils down to my point about the fear of miscommunicating—the questions like “how should I communicate my findings,” “what do my findings say about doing further analysis,” and “what are my findings current best-guess estimates.” If you think it goes beyond that—that it is actually “intrinsically incorrect-as-written,” I could write up a longer reply elaborating on the following: I’d pose the question back at you and ask whether it’s really justified or optimal to include ambiguity-laden “ranges” assuming there will be no miscommunication risks (e.g., nobody assumes “he said 57.61% so he must be very confident he’s right and doing more analysis won’t be useful”)? If you say “there’s a 1%-99% chance that a given coin will land on heads” because the coin is weighted but you don’t know whether it’s for heads or tails, how is this functionally any different from saying “my best guess is that on one flip the coin has a 50% chance of landing on heads”? (Again, I could elaborate further if needed)
Sure, I agree. But that doesn’t change the decision in the example I gave, at least when you leave it at “upon further investigation it’s actually about 51-49.” In either case, the expected benefit-cost ratio is still roughly around 2:1. When facing analytical constraints and for this purely theoretical case, it seems optimal to do the 1/n estimate rather than “NaN” or “” or “???” which breaks your whole model and prevents you from calculating anything, so long as you’re setting aside all miscommunication risks (which was the main point of my comment: to try to disentangle miscommunication and related risks from the ability to use 1/n probabilities as a default optimal). To paraphrase what I said for a different comment, in the real world maybe it is better to just throw a wrench in the whole model and say “dear principal: no, stop, we need to disengage autopilot and think longer.” But I’m not at the real world yet, because I want to make sure I am clear on why I see so many people say things like you can’t give probability estimates for pure uncertainty (when in reality it seems nothing is certain anyway and thus you can’t give 100.0% “true” point or range estimates for anything).
Suppose you think you might come up with new hypotheses in the future which will cause you to reevaluate how the existing evidence supports your current hypotheses. In this case probabilistically modelling the phenomenon doesn’t necessarily get you the right “value of further investigation” (because you’re not modelling hypothesis X), but you might still be well advised to hold off acting and investigate further—collecting more data might even be what leads to you thinking of the new hypothesis, leading to a “non Bayesian update”. That said, I think you could separately estimate the probability of a revision of this type.
Similarly, you might discover a new outcome that’s important that you’d previously neglected to include in your models.
One more thing: because probability is difficult to work with, even if it is in principle compatible with adaptive plans, it might in practice tend to steer away from them.
I basically agree (although it might provide a decent amount of information to this end), but this does not reject the idea that you can make a probability estimate equally or more accurate than pure 1/n uncertainty.
Ultimately, if you want to focus on “what is the expected value of doing further analyses to improve my probability estimates,” I say go for it. You often shouldn’t default to accepting pure 1/n ignorance. But I still can’t imagine a situation that truly matches “Level 4 or Level 5 Uncertainty,” where there is nothing as good as or better than pure 1/n ignorance. If you truly know absolutely and purely nothing about a probability distribution—which almost never happens—then it seems 1/n estimates will be the default optimal distribution, because anything else would require being able to offer supposedly-nonexistent information to justify that conclusion.
Ultimately, a better framing (to me) would seem like “if you find yourself at 1/n ignorance, you should be careful not to accept that as a legitimate probability estimate unless you are really rock solid confident it won’t improve.” No?
I think this question—whether it’s better to take 1/n probabilities (or maximum entropy distributions or whatever) or to adopt some “deep uncertainty” strategy—does not have an obvious answer
I actually think it probably (pending further objections) does have a somewhat straightforward answer with regards to the rather narrow, theoretical cases that I have in mind, which relate to the confusion I had which started this comment chain.
It’s hard to accurately convey the full degree of my caveats/specifications, but one simple example is something like “Suppose you are forced to choose whether to do X or nothing (Y). You are purely uncertain whether X will lead to outcome Great (Q), Good (P), or Bad (W), and there is guaranteed to be no way to get further information on this. However, you can safely assume that outcome Q is guaranteed to lead to +1,000 utils, P is guaranteed to lead to +500 utils, and W is guaranteed to lead to −500 utils. Doing nothing is guaranteed to lead to 0 utils. What should you do, assuming utils do not have non-linear effects?”
In this scenario, it seems very clear to me that a strategy of “do nothing” is inferior to doing X: even though you don’t know what the actual probabilities of Q, P, and W are, I don’t understand how the 1/n default will fail to work (across a sufficiently large number of 1/n cases). And when taking the 1/n estimate as a default, the expected utility is positive.
Of course, outside of barebones theoretical examples (I.e., in the real world) I don’t think there is a simple, straightforward algorithm for deciding when to pursue more information vs. act on limited information with significant uncertainty.
Good point! I think this is also a matter of risk aversion. How severe is it to get to a state of −500 utils? If you are very risk-averse, it might be better to do nothing. But I cannot make such a blanket statement.
I’d like to emphasize at this point that the DMDU approach is trying to avoid to
test the performance of a set of policies for a set number of scenarios,
decide how likely each scenario is (this is the crux), and
calculate some weighted average for each policy.
Instead, we use DMDU to consider the full range of plausible scenarios to explore and identify particularly vulnerable scenarios. We want to pay special attention to these scenarios and find optimal and robust solutions for them. Like this, we cover tail risks which is quite in line IMO with mitigation efforts of GCRs, x-risks, and s-risks.
I would disagree with this particular statement. I’m not saying the opposite either. I think, it’s reasonable in a lot of cases to assume some probability distributions. However, there are lot of cases, where we just do not know at all. E.g., take the space of possible minds. What’s our probability distribution of our first AGI over this space? I personally don’t know. Even looking at binary events – What’s our probability distribution for AI x-risk this century? 10%? I find this widely used number implausible.
But I agree that we can try gathering more information to get more clarity on that. What is often done in DMDU analysis is that we figure out that some uncertainty variables don’t have much of an impact on our system anyway (so we fix the variables to some value) or that we constrain their value ranges to focus on more relevant subspaces. The DMDU framework does not necessitate or advocate for total ignorance. I think, there is room for an in-between.
A friend of mine just mentioned to me that the following points could be useful in the context of this discussion.
What DMDU researchers are usually doing is to use uniform probability distributions for all parameters when exploring future scenarios. This approach allows for a more even exploration of the plausible space, rather than being overly concerned with subjective probabilities, which may lead to sampling some regions of input-output space less densely and potentially missing decision-relevant outcomes. The benefit of using uniform probability distributions is that it can help to avoid compounding uncertainties in a way that can lead to biased results. When you use a uniform distribution, you assume that all values are equally likely within the range of possible outcomes. This approach can help to ensure that your exploration of the future is more comprehensive and that you are not overlooking important possibilities. Of course, there may be cases where subjective probabilities are essential, such as when there is prior knowledge or data that strongly suggests certain outcomes are more likely than others. In such cases, I’d say that it may be appropriate to incorporate those probabilities into the model.
Also, this paper by James Derbyshire on probability-based versus plausibility-based scenarios might be very relevant. The underlying idea of plausibility-based scenarios is that any technically possible outcome of a model is plausible in the real world, regardless of its likelihood (given that the model has been well validated). This approach recognizes that complex systems, especially those with deep uncertainties, can produce unexpected outcomes that may not have been considered in a traditional probability-based approach. When making decisions under deep uncertainty, it’s important to take seriously the range of technically possible but seemingly unlikely outcomes. This is where the precautionary principle comes in (which advocates for taking action to prevent harm even when there is uncertainty about the likelihood of that harm). By including these “fat tail” outcomes in our analysis, we are able to identify and prepare for potentially severe outcomes that may have significant consequences. Additionally, nonlinearities can further complicate the relationship between probability and plausibility. In some cases, even a small change in initial conditions or inputs can lead to drastic differences in the final outcome. By exploring the range of plausible outcomes rather than just the most likely outcomes, we can better understand the potential consequences of our decisions and be more prepared to mitigate risks and respond to unexpected events.
I hope that helps!
I’m not sure I disagree with any of this, and in fact if I understood correctly, the point about using uniform probability distributions is basically what I was suggesting: it seems like you can always put 1/n instead of a “?” which just breaks your model. I agree that sometimes it’s better to say “?” and break the model because you don’t always want to analyze complex things on autopilot through uncertainty (especially if there’s a concern that your audience will misinterpret your findings), but sometimes it is better to just say “we need to put something in, so let’s put 1/n and flag it for future analysis/revision.”
Yes, I think that in this sense, it fits rather well! :)