## Wednesday, April 2, 2014

### Jeffreys: Scientific Inference, third edition (1973)

This book, first published 1931, covers much of the same ground as Jeffreys' Theory of Probability (1939), but it's shorter and easier to read.

It touches on a number of extremely interesting topics, including
• the asymptotic equivalence of Bayesian inference and maximum likelihood inference (§4.3, pp. 73–74)
• a Bayesian notion of "statistical test" (§3.5, pp. 54–56)
• priors based on theory complexity (§2.5, especially pp. 34–39)
• the convergence of predictive distributions to the truth under (very) nice conditions (§2.5)
• an inductive justification of the principle of induction through hierarchical models (§3.7, pp. 58–60)
• a critique of the frequency theory of probability (§9.21, pp. 193–197)
• a number of other philosophical issues surrounding induction and probability (§9)
I might write about some of these issues later, but now I want to focus on a specific little detail that I liked. It's a combinatorical argument for Laplace's rule, which I have otherwise only seen justified through the use of Euler integrals.

 Laplace's rule: Add a "virtual count" to each bucket before the parameter estimation.

### The Generative Set-Up

Suppose that you have a population of n swans, r of which are white. We'll assume that r is uniformly distributed on 0, 1, 2, …, n. You now inspect a sample of m < n swans and find s white swans among them.

It then turns out that the probability that the next swan is white is completely independent of n: Whatever the size of the population is, the probability of seeing one more white swan turns out to be (s + 1)/(m + 1) when we integrate out the effect of r.

 A population of n swans contains r white ones; in a sample of m swans, s are white.

Let me go into a little more detail. Given n, m, and r, the probability of finding s white swans in the sample follows a hypergeometric distribution; that is,
Pr( s | n, m, r )  ==  C(r, s) C(nr, ms) / C(n, m),
where C(a, b) is my one-dimensional notation for the binomial coefficient "a choose b." The argument for this formula is that
• C(r, s) is the number ways of choosing s white swans out of a total of r white swans.
• C(nr, ms) is the number of ways of choosing the remaining ms swans in the sample from the remaining nr swans in the population.
• C(n, m) is the total number of ways to sample m swans from a population of n.
The numerator thus counts the number of ways to select the sample so that it respects the constraint set by the number s, while the denominator counts the number of samples with or without this constraint.

### Inverse Hypergeometric Probabilities

In general, binomial coefficients have the following two properties:
• C(a, b) (ab)  ==  C(a, b + 1) (b + 1)
• C(a + 1, b + 1)  ==  C(a, b) (a + 1)/(b + 1)
We'll need both of these facts below. They can be shown directly by cancelling out factors in the factorial expression for the binomial coefficients.

One consequence is that Bayes' rule takes on a particularly simple form in the hypergeometric case:
• Pr( r | n, m, s )  ==  Pr( s | n, m, r ) (m + 1)/(n + 1)
• Pr( s | n, m, r )  ==  Pr( r | n, m, s ) (n + 1)/(m + 1)
• Pr( s + 1 | n, m + 1, r )  ==  Pr( r | n, m + 1, s + 1 ) (n + 1)/(m + 1)
These equalities are, of course, saying the same thing, but I state all three forms because they will all come up.

By using the first of the two rules for binomial coefficients, we can also show that
Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( s + 1 | n, m + 1, r ) (s + 1)/(n + 1)
According to the last fact about the inverse hypergeometric probabilities, this also means that
Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
I have cancelled two occurrences of (n + 1) to arrive at this expression. I will use this fact below.

### Expanding the Predictive Probability

By assumption, we have inspected s out of the r white swans, so there are rs white swans left. We have further inspected m out of the n swans, so there is a total of nm swans left. The probability that the next swan will be white is thus (rs)/(nm).

If we call this event q, then we have, by the sum rule of probability,
Pr( q | n, m, s )  ==   Σr Pr( q, r | n, m, s )
By the chain rule of probabilities, we further have
Pr( q | n, m, s )  ==   Σr Pr( q | n, m, s, r ) Pr( r | n, m, s )
As argued above, we have
• Pr( q | n, m, s, r ) = (rs)/(nm)
• Pr( r | n, m, s ) = Pr( s | n, m, r ) (m + 1)/(n + 1)
• Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
Putting these facts together and cancelling, we get
Pr( q | n, m, s )  ==   (s + 1)/(m + 1) Σr Pr( r | n, m + 1, s + 1 )
I have pulled the constant factors out the summation here. Notice further that the summation is a sum of probabilities for the possible values of r. It must consequently sum to 1. We thus have
Pr( q | n, m, s )  ==   (s + 1) / (m + 1)
as we wanted to prove.