The maximum entropy principle does give implausible results if applied carelessly but the above reasoning seems very strange to me. The normal way to model this kind of scenario with the maximum entropy prior would be via Laplace’s Rule of Succession, as in Max’s comment below. We start with a prior for the probability that a randomly drawn ball is red and can then update on 99 red balls. This gives a 100⁄101 chance that the final ball is red (about 99%!). Or am I missing your point here?
Somewhat more formally, we’re looking at a Bernoulli trial—for each ball, there’s a probability p that it’s red. We start with the maximum entropy prior for p, which is the uniform distribution on the interval [0,1] (= beta(1,1)). We update on 99 red balls, which gives a posterior for p of beta(100,1), which has mean 100⁄101 (this is a standard result, see e.g. conjugate priors - the beta distribution is a conjugate prior for a Bernoulli likelihood).
The more common objection to the maximum entropy principle comes when we try to reparametrise. A nice but simple example is van Fraassen’s cube factory (edit: new link): a factory manufactures cubes up to 2x2x2 feet, what’s the probability that a randomly selected cube has side length less than 1 foot? If we apply the maximum entropy principle (MEP), we say 1⁄2 because each cube has length between 0 and 2 and MEP implies that each length is equally likely. But we could have equivalently asked: what’s the probability that a randomly selected cube has face area less than 1 foot squared? Face area ranges from 0 to 4, so MEP implies a probability of 1⁄4. All and only those cubes with side length less than 1 have face area less than 1, so these are precisely the same events but MEP gave us different answers for their probabilities! We could do the same in terms of volume and get a different answer again. This inconsistency is the kind of implausible result most commonly pointed to.
I think I disagree that that is the right maximum entropy prior in my ball example.
You know that you are drawing balls without replacement from a bag containing 100 balls, which can only be coloured blue or red. The maximum entropy prior given this information is that every one of the 2^100 possible colourings {Ball 1, Ball 2, Ball 3, …} → {Red, Blue} is equally likely (i.e. from the start the probability that all balls are red is 1 over 2^100).
I think the model you describe is only the correct approach if you make an additional assumption that all balls were coloured using an identical procedure, and were assigned to red or blue with some unknown, but constant, probability p. But that is an additional assumption. The assumption that the unknown p is the same for each ball is actually a very strong assumption.
If you want to adopt the maximum entropy prior consistent with the information I gave in the set-up of the problem, you’d adopt a prior where each of the 2^100 possible colourings are equally likely.
I think this is the right way to think about it anyway.
The re-paremetrisation example is very nice though, I wasn’t aware of that before.
Thanks for the clarification—I see your concern more clearly now. You’re right, my model does assume that all balls were coloured using the same procedure, in some sense—I’m assuming they’re independently and identically distributed.
Your case is another reasonable way to apply the maximum entropy principle and I think it’s points to another problem with the maximum entropy principle but I think I’d frame it slightly differently. I don’t think that the maximum entropy principle is actually directly problematic in the case you describe. If we assume that all balls are coloured by completely different procedures (i.e. so that the colour of one ball doesn’t tell us anything about the colours of the other balls), then seeing 99 red balls doesn’t tell us anything about the final ball. In that case, I think it’s reasonable (even required!) to have a 50% credence that it’s red and unreasonable to have a 99% credence, if your prior was 50%. If you find that result counterintuitive, then I think that’s more of a challenge to the assumption that the balls are all coloured in such a way that learning the colour of some doesn’t tell you anything about the colour of the others rather than a challenge to the maximum entropy principle. (I appreciate you want to assume nothing about the colouring processes rather than making the assumption that the balls are all coloured in such a way that learning the colour of some doesn’t tell you anything about the colour of the others, but in setting up your model this way, I think you’re assuming that implicitly.)
Perhaps another way to see this: if you don’t follow the maximum entropy principle and instead have a prior of 30% that the final ball is red and then draw 99 red balls, in your scenario, you should maintain 30% credence (if you don’t, then you’ve assumed something about the colouring process that makes the balls not independent). If you find that counterintuitive, then the issue is with the assumption that the balls are all coloured in such a way that learning the colour of some doesn’t tell you anything about the colour of the others because we haven’t used the principle of maximum entropy in that case.
I think this actually points to a different problem with the maximum entropy principle in practice: we rarely come from a position of complete ignorance (or complete ignorance besides a given mean, variance etc.), so it’s actually rarely applicable. Following the principle sometimes gives counterintuive/unreasonable results because we actually know a lot more than we realise and we lose much of that information when we apply the maximum entropy principle.
I think I disagree with your claim that I’m implicitly assuming independence of the ball colourings.
I start by looking for the maximum entropy distribution within all possible probability distributions over the 2^100 possible colourings. Most of these probability distributions do not have the property that balls are coloured independently. For example, if the distribution was a 50% probability of all balls being red, and 50% probability of all balls being blue, then learning the colour of a single ball would immediately tell you the colour of all of the others.
But it just so happens that for the probability distribution which maximises the entropy, the ball colourings do turn out to be independent. If you adopt the maximum entropy distribution as your prior, then learning the colour of one tells you nothing about the others. This is an output of the calculation, rather than an assumption.
I think I agree with your last paragraph, although there are some real problems here that I don’t know how to solve. Why should we expect any of our existing knowledge to be a good guide to what we will observe in future? It has been a good guide in the past, but so what? 99 red balls apparently doesn’t tell us that the 100th will likely be red, for certain seemingly reasonable choices of prior.
I guess what I was trying to say in my first comment is that the maximum entropy principle is not a solution to the problem of induction, or even an approximate solution. Ultimately, I don’t think anyone knows how to choose priors in a properly principled way. But I’d very much like to be corrected on this.
As a side-note, the maximum entropy principle would tell you to choose the maximum entropy prior given the information you have, and so if you intuit the information that the balls are likely to be produced by the same process, you’ll get a different prior that if you don’t have that information.
I.e., your disagreement might stem from the fact that the maximum entropy principle gives different answers conditional on different information.
I.e., you actually have information to differentiate between drawing n balls and flipping a fair coin n times.
The maximum entropy principle does give implausible results if applied carelessly but the above reasoning seems very strange to me. The normal way to model this kind of scenario with the maximum entropy prior would be via Laplace’s Rule of Succession, as in Max’s comment below. We start with a prior for the probability that a randomly drawn ball is red and can then update on 99 red balls. This gives a 100⁄101 chance that the final ball is red (about 99%!). Or am I missing your point here?
Somewhat more formally, we’re looking at a Bernoulli trial—for each ball, there’s a probability p that it’s red. We start with the maximum entropy prior for p, which is the uniform distribution on the interval [0,1] (= beta(1,1)). We update on 99 red balls, which gives a posterior for p of beta(100,1), which has mean 100⁄101 (this is a standard result, see e.g. conjugate priors - the beta distribution is a conjugate prior for a Bernoulli likelihood).
The more common objection to the maximum entropy principle comes when we try to reparametrise. A nice but simple example is van Fraassen’s cube factory (edit: new link): a factory manufactures cubes up to 2x2x2 feet, what’s the probability that a randomly selected cube has side length less than 1 foot? If we apply the maximum entropy principle (MEP), we say 1⁄2 because each cube has length between 0 and 2 and MEP implies that each length is equally likely. But we could have equivalently asked: what’s the probability that a randomly selected cube has face area less than 1 foot squared? Face area ranges from 0 to 4, so MEP implies a probability of 1⁄4. All and only those cubes with side length less than 1 have face area less than 1, so these are precisely the same events but MEP gave us different answers for their probabilities! We could do the same in terms of volume and get a different answer again. This inconsistency is the kind of implausible result most commonly pointed to.
I think I disagree that that is the right maximum entropy prior in my ball example.
You know that you are drawing balls without replacement from a bag containing 100 balls, which can only be coloured blue or red. The maximum entropy prior given this information is that every one of the 2^100 possible colourings {Ball 1, Ball 2, Ball 3, …} → {Red, Blue} is equally likely (i.e. from the start the probability that all balls are red is 1 over 2^100).
I think the model you describe is only the correct approach if you make an additional assumption that all balls were coloured using an identical procedure, and were assigned to red or blue with some unknown, but constant, probability p. But that is an additional assumption. The assumption that the unknown p is the same for each ball is actually a very strong assumption.
If you want to adopt the maximum entropy prior consistent with the information I gave in the set-up of the problem, you’d adopt a prior where each of the 2^100 possible colourings are equally likely.
I think this is the right way to think about it anyway.
The re-paremetrisation example is very nice though, I wasn’t aware of that before.
Thanks for the clarification—I see your concern more clearly now. You’re right, my model does assume that all balls were coloured using the same procedure, in some sense—I’m assuming they’re independently and identically distributed.
Your case is another reasonable way to apply the maximum entropy principle and I think it’s points to another problem with the maximum entropy principle but I think I’d frame it slightly differently. I don’t think that the maximum entropy principle is actually directly problematic in the case you describe. If we assume that all balls are coloured by completely different procedures (i.e. so that the colour of one ball doesn’t tell us anything about the colours of the other balls), then seeing 99 red balls doesn’t tell us anything about the final ball. In that case, I think it’s reasonable (even required!) to have a 50% credence that it’s red and unreasonable to have a 99% credence, if your prior was 50%. If you find that result counterintuitive, then I think that’s more of a challenge to the assumption that the balls are all coloured in such a way that learning the colour of some doesn’t tell you anything about the colour of the others rather than a challenge to the maximum entropy principle. (I appreciate you want to assume nothing about the colouring processes rather than making the assumption that the balls are all coloured in such a way that learning the colour of some doesn’t tell you anything about the colour of the others, but in setting up your model this way, I think you’re assuming that implicitly.)
Perhaps another way to see this: if you don’t follow the maximum entropy principle and instead have a prior of 30% that the final ball is red and then draw 99 red balls, in your scenario, you should maintain 30% credence (if you don’t, then you’ve assumed something about the colouring process that makes the balls not independent). If you find that counterintuitive, then the issue is with the assumption that the balls are all coloured in such a way that learning the colour of some doesn’t tell you anything about the colour of the others because we haven’t used the principle of maximum entropy in that case.
I think this actually points to a different problem with the maximum entropy principle in practice: we rarely come from a position of complete ignorance (or complete ignorance besides a given mean, variance etc.), so it’s actually rarely applicable. Following the principle sometimes gives counterintuive/unreasonable results because we actually know a lot more than we realise and we lose much of that information when we apply the maximum entropy principle.
I think I disagree with your claim that I’m implicitly assuming independence of the ball colourings.
I start by looking for the maximum entropy distribution within all possible probability distributions over the 2^100 possible colourings. Most of these probability distributions do not have the property that balls are coloured independently. For example, if the distribution was a 50% probability of all balls being red, and 50% probability of all balls being blue, then learning the colour of a single ball would immediately tell you the colour of all of the others.
But it just so happens that for the probability distribution which maximises the entropy, the ball colourings do turn out to be independent. If you adopt the maximum entropy distribution as your prior, then learning the colour of one tells you nothing about the others. This is an output of the calculation, rather than an assumption.
I think I agree with your last paragraph, although there are some real problems here that I don’t know how to solve. Why should we expect any of our existing knowledge to be a good guide to what we will observe in future? It has been a good guide in the past, but so what? 99 red balls apparently doesn’t tell us that the 100th will likely be red, for certain seemingly reasonable choices of prior.
I guess what I was trying to say in my first comment is that the maximum entropy principle is not a solution to the problem of induction, or even an approximate solution. Ultimately, I don’t think anyone knows how to choose priors in a properly principled way. But I’d very much like to be corrected on this.
As a side-note, the maximum entropy principle would tell you to choose the maximum entropy prior given the information you have, and so if you intuit the information that the balls are likely to be produced by the same process, you’ll get a different prior that if you don’t have that information.
I.e., your disagreement might stem from the fact that the maximum entropy principle gives different answers conditional on different information.
I.e., you actually have information to differentiate between drawing n balls and flipping a fair coin n times.