I found the answers to this question on stats.stackexchange useful for thinking about and getting a rough overview of “uninformative” priors, though it’s mainly a bit too technical to be able to easily apply in practice. It’s aimed at formal Bayesian inference rather than more general forecasting.
In information theory, entropy is a measure of (lack of) information—high entropy distributions have low information. That’s why the principle of maximum entropy, as Max suggested, can be useful.
Another meta answer is to use Jeffreys prior. This has the property that it is invariant under a change of coordinates. This isn’t the case for maximum entropy priors in general and is a source of inconsistency (see e.g. the partition problem for the principle of indifference, which is just a special case of the principle of maximum entropy). Jeffrey’s priors are often unwieldy, but one important exception is for the interval [0,1] (e.g. for a probability), for which the Jeffrey’s prior is the beta(1/2,1/2) distribution. See the red line in the graph at the top of the beta distribution Wikipedia page—the density is spread to the edges close to 0 and 1.
This relates to Max’s comment about Laplace’s Rule of Succession: taking N_v = 2, M_v = 1 corresponds to the uniform distribution on [0,1] (which is just beta(1,1)). This is the maximum entropy entropy distribution on [0,1]. But as Max mentioned, we can vary N_v and M_v. Using Jeffrey’s prior would be like setting N_v = 1 and M_v = 1⁄2, which doesn’t have as nice an interpretation (1/2 a success?) but has nice theoretical features. Especially useful if you want to put the density around 0 and 1 but still have mean 1⁄2.
Finally, a bit of a cop-out, but I think worth mentioning, is the suggestion of imprecise credences in one of the answers to the stats.stackexchange question linked above. Select a range of priors and seeing how much they converge, you might find prior choice doesn’t matter that much and when it does matter, I expect this could be useful for determining your largest uncertainties.
I’m confused about the partition problem you linked to. Both examples in that post seem to be instances where in one partition available information is discarded.
Suppose you have a jar of blue, white, and black marbles, of unknown proportions. One is picked at random, and if it is blue, the light is turned on. If it is black or white, the light stays off (or is turned off). What is the probability the light is on?
There isn’t one single answer. In fact, there are several possible answers.
[1.] You might decide to assign a 1⁄2 probability to the light being on, because you’ve got no reason to assign any other odds. It’s either on (50%) or off (50%).
[2.] You could assign the blue marble a 1⁄3 probability of being selected (after all, you know that there are three colors). From this it would follow that you have a 1⁄3 chance of the light being on, and 2⁄3 chance of the light being off.
Answer 1. seems to simply discard information about the algorithm that produces the result, i.e. that it depends on the color of the marbles. The same holds for the other example in the blogpost, where the information about the number of possible planets is ignored in one partition.
yeah, these aren’t great examples because there’s a choice of partition which is better than the others—thanks for pointing this out. The problem is more salient if instead, you suppose that you have no information about how many different coloured marbles there are and ask what the probability of picking a blue marble is. There are different ways of partitioning the possibilities but no obviously privileged partition. This is how Hilary Greaves frames it here.
Another good example is van Fraassen’s cube factory, e.g. described here.
Thanks a lot for the pointers! Greaves’ example seems to suffer the same problem, though, doesn’t it?
Suppose, for instance, you know only that I am about to draw a book from my shelf, and that each book on my shelf has a single-coloured cover. Then POI seems to suggest that you are rationally required to have credence ½ that it will be red (Q1=red, Q2 = not-red; and you have no evidence bearing on whether or not the book is red), but also that you are rationally required to have credence 1/n that it will be red, where n is the ‘number of possible colours’ (Qi = ith colour; and you have no evidence bearing on what colour the book is).)
We have information about the set and distribution of colors, and assigning 50% credence to the color red does not use that information.
The cube factory problem does suffer less from this, cool!
A factory produces cubes with side-length between 0 and 1 foot; what is the probability that a randomly chosen cube has side-length between 0 and 1⁄2 a foot? The classical intepretation’s answer is apparently 1⁄2, as we imagine a process of production that is uniformly distributed over side-length. But the question could have been given an equivalent restatement: A factory produces cubes with face-area between 0 and 1 square-feet; what is the probability that a randomly chosen cube has face-area between 0 and 1⁄4 square-feet? Now the answer is apparently 1⁄4, as we imagine a process of production that is uniformly distributed over face-area.
I wonder if one should simply model this hierarchically, assigning equal credence to the idea that the relevant measure in cube production is side length or volume. For example, we might have information about cube bottle customers that want to fill their cubes with water. Because the customers vary in how much water they want to fit in their cube bottles, it seems to me that we should put more credence into partitioning it according to volume. Or if we’d have some information that people often want to glue the cubes under their shoes to appear taller, the relevant measure would be the side length. Currently, we have no information like this, so we should assign equal credence to both measures.
I don’t think Greaves’ example suffers the same problem actually—if we truly don’t know anything about what the possible colours are (just that each book has one colour), then there’s no reason to prefer {red, yellow, blue, other} over {red, yellow, blue, green, other}.
In the case of truly having no information, I think it makes sense to use Jeffreys prior in the box factory case because that’s invariant to reparametrisation, so it doesn’t matter whether the problem is framed in terms of length, area, volume, or some other parameterisation. I’m not sure what that actually looks like in this case though
Hm, but if we don’t know anything about the possible colours, the natural prior to assume seems to me to give all colors the same likelihood. It seems arbitrary to decide to group a subsection of colors under the label “other”, and pretend like it should be treated like a hypothesis on equal footing with the others in your given set, which are single colors.
I found the answers to this question on stats.stackexchange useful for thinking about and getting a rough overview of “uninformative” priors, though it’s mainly a bit too technical to be able to easily apply in practice. It’s aimed at formal Bayesian inference rather than more general forecasting.
In information theory, entropy is a measure of (lack of) information—high entropy distributions have low information. That’s why the principle of maximum entropy, as Max suggested, can be useful.
Another meta answer is to use Jeffreys prior. This has the property that it is invariant under a change of coordinates. This isn’t the case for maximum entropy priors in general and is a source of inconsistency (see e.g. the partition problem for the principle of indifference, which is just a special case of the principle of maximum entropy). Jeffrey’s priors are often unwieldy, but one important exception is for the interval [0,1] (e.g. for a probability), for which the Jeffrey’s prior is the beta(1/2,1/2) distribution. See the red line in the graph at the top of the beta distribution Wikipedia page—the density is spread to the edges close to 0 and 1.
This relates to Max’s comment about Laplace’s Rule of Succession: taking N_v = 2, M_v = 1 corresponds to the uniform distribution on [0,1] (which is just beta(1,1)). This is the maximum entropy entropy distribution on [0,1]. But as Max mentioned, we can vary N_v and M_v. Using Jeffrey’s prior would be like setting N_v = 1 and M_v = 1⁄2, which doesn’t have as nice an interpretation (1/2 a success?) but has nice theoretical features. Especially useful if you want to put the density around 0 and 1 but still have mean 1⁄2.
There’s a bit more discussion of Laplace’s Rule of Sucession and Jeffrey’s prior in an EA context in Toby Ord’s comment in response to Will MacAskill’s Are we living at the most influential time in history?
Finally, a bit of a cop-out, but I think worth mentioning, is the suggestion of imprecise credences in one of the answers to the stats.stackexchange question linked above. Select a range of priors and seeing how much they converge, you might find prior choice doesn’t matter that much and when it does matter, I expect this could be useful for determining your largest uncertainties.
I’m confused about the partition problem you linked to. Both examples in that post seem to be instances where in one partition available information is discarded.
Answer 1. seems to simply discard information about the algorithm that produces the result, i.e. that it depends on the color of the marbles. The same holds for the other example in the blogpost, where the information about the number of possible planets is ignored in one partition.
yeah, these aren’t great examples because there’s a choice of partition which is better than the others—thanks for pointing this out. The problem is more salient if instead, you suppose that you have no information about how many different coloured marbles there are and ask what the probability of picking a blue marble is. There are different ways of partitioning the possibilities but no obviously privileged partition. This is how Hilary Greaves frames it here.
Another good example is van Fraassen’s cube factory, e.g. described here.
Thanks a lot for the pointers! Greaves’ example seems to suffer the same problem, though, doesn’t it?
We have information about the set and distribution of colors, and assigning 50% credence to the color red does not use that information.
The cube factory problem does suffer less from this, cool!
I wonder if one should simply model this hierarchically, assigning equal credence to the idea that the relevant measure in cube production is side length or volume. For example, we might have information about cube bottle customers that want to fill their cubes with water. Because the customers vary in how much water they want to fit in their cube bottles, it seems to me that we should put more credence into partitioning it according to volume. Or if we’d have some information that people often want to glue the cubes under their shoes to appear taller, the relevant measure would be the side length. Currently, we have no information like this, so we should assign equal credence to both measures.
I don’t think Greaves’ example suffers the same problem actually—if we truly don’t know anything about what the possible colours are (just that each book has one colour), then there’s no reason to prefer {red, yellow, blue, other} over {red, yellow, blue, green, other}.
In the case of truly having no information, I think it makes sense to use Jeffreys prior in the box factory case because that’s invariant to reparametrisation, so it doesn’t matter whether the problem is framed in terms of length, area, volume, or some other parameterisation. I’m not sure what that actually looks like in this case though
Hm, but if we don’t know anything about the possible colours, the natural prior to assume seems to me to give all colors the same likelihood. It seems arbitrary to decide to group a subsection of colors under the label “other”, and pretend like it should be treated like a hypothesis on equal footing with the others in your given set, which are single colors.
Yeah, Jeffreys prior seems to make sense here.