It touches on a number of extremely interesting topics, including
- the asymptotic equivalence of Bayesian inference and maximum likelihood inference (§4.3, pp. 73–74)
- a Bayesian notion of "statistical test" (§3.5, pp. 54–56)
- priors based on theory complexity (§2.5, especially pp. 34–39)
- the convergence of predictive distributions to the truth under (very) nice conditions (§2.5)
- an inductive justification of the principle of induction through hierarchical models (§3.7, pp. 58–60)
- a critique of the frequency theory of probability (§9.21, pp. 193–197)
- a number of other philosophical issues surrounding induction and probability (§9)
Laplace's rule: Add a "virtual count" to each bucket before the parameter estimation. |
The Generative Set-Up
Suppose that you have a population of n swans, r of which are white. We'll assume that r is uniformly distributed on 0, 1, 2, …, n. You now inspect a sample of m < n swans and find s white swans among them.It then turns out that the probability that the next swan is white is completely independent of n: Whatever the size of the population is, the probability of seeing one more white swan turns out to be (s + 1)/(m + 1) when we integrate out the effect of r.
A population of n swans contains r white ones; in a sample of m swans, s are white. |
Let me go into a little more detail. Given n, m, and r, the probability of finding s white swans in the sample follows a hypergeometric distribution; that is,
Pr( s | n, m, r ) == C(r, s) C(n – r, m – s) / C(n, m),where C(a, b) is my one-dimensional notation for the binomial coefficient "a choose b." The argument for this formula is that
- C(r, s) is the number ways of choosing s white swans out of a total of r white swans.
- C(n – r, m – s) is the number of ways of choosing the remaining m – s swans in the sample from the remaining n – r swans in the population.
- C(n, m) is the total number of ways to sample m swans from a population of n.
Inverse Hypergeometric Probabilities
In general, binomial coefficients have the following two properties:- C(a, b) (a – b) == C(a, b + 1) (b + 1)
- C(a + 1, b + 1) == C(a, b) (a + 1)/(b + 1)
One consequence is that Bayes' rule takes on a particularly simple form in the hypergeometric case:
- Pr( r | n, m, s ) == Pr( s | n, m, r ) (m + 1)/(n + 1)
- Pr( s | n, m, r ) == Pr( r | n, m, s ) (n + 1)/(m + 1)
- Pr( s + 1 | n, m + 1, r ) == Pr( r | n, m + 1, s + 1 ) (n + 1)/(m + 1)
By using the first of the two rules for binomial coefficients, we can also show that
Pr( s | n, m, r ) (r – s)/(n – m) == Pr( s + 1 | n, m + 1, r ) (s + 1)/(n + 1)According to the last fact about the inverse hypergeometric probabilities, this also means that
Pr( s | n, m, r ) (r – s)/(n – m) == Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)I have cancelled two occurrences of (n + 1) to arrive at this expression. I will use this fact below.
Expanding the Predictive Probability
By assumption, we have inspected s out of the r white swans, so there are r – s white swans left. We have further inspected m out of the n swans, so there is a total of n – m swans left. The probability that the next swan will be white is thus (r – s)/(n – m).If we call this event q, then we have, by the sum rule of probability,
Pr( q | n, m, s ) == Σr Pr( q, r | n, m, s )By the chain rule of probabilities, we further have
Pr( q | n, m, s ) == Σr Pr( q | n, m, s, r ) Pr( r | n, m, s )As argued above, we have
- Pr( q | n, m, s, r ) = (r – s)/(n – m)
- Pr( r | n, m, s ) = Pr( s | n, m, r ) (m + 1)/(n + 1)
- Pr( s | n, m, r ) (r – s)/(n – m) == Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
Pr( q | n, m, s ) == (s + 1)/(m + 1) Σr Pr( r | n, m + 1, s + 1 )I have pulled the constant factors out the summation here. Notice further that the summation is a sum of probabilities for the possible values of r. It must consequently sum to 1. We thus have
Pr( q | n, m, s ) == (s + 1) / (m + 1)as we wanted to prove.
No comments :
Post a Comment