Thursday, June 12, 2014

Kullback and Leibler: "On Information and Sufficiency" (1951)

This paper is mostly known for introducing the concept of divergence into statistics. It's quite notation-heavy but actually quite straightforward if you've already solved exercise 3.9 in Cover and Thomas' textbook. Everything is about log-likelihood ratios that fly off into infinity.

The paper is concerned quite a lot with the question of sufficient statistics. An interesting illustration in the last paragraph (§6) concerns the problem of how to choose the best non-sufficient statistic when we must (for some extrinsic reason).

More specifically, the set up is the following: You are observing samples from a categorical distribution with four different categories, and you want to distinguish the two following hypotheses:
x 1 2 3 4
Pr(x | A)   .25   .25   .25   .25 
Pr(x | B)  .50   .30   .10   .10 

However, you are not actually allowed to see the original data set; instead, you have to lump the categories into two groups, and you will then be told which group the observation fell. (This corresponds to using an indicator function as your statistic.)

In order to solve this exercise, they quantify the informativeness of the statistic in terms of the symmetrized divergence between the two hypotheses:
J(A, B)  ==  D(A || B) + D(B || A).
I guess a more cautious measure would have been the minimum of the two, since this would provide a lower bound on the growth rate of the likelihood ratio whichever of the two hypotheses were true.

Using base 10 logarithms, the divergence from the uniform to the skewed distribution (A to B) is .1039, and the divergence from the skewed to the uniform (B to A) is .0947. The symmetrized divergence J(A, B) is thus about .1986, and the minimum is .0947.

Grouping the categories in various ways, we can also find the following divergences:

Grouping  D(A || B)  D(A || B)   J(A, B     min    
{}, {1, 2, 3, 4}   0   0   0 
{1}, {2, 3, 4} .5068 .0635 .1193.0635
{2}, {1, 3, 4} .0027 .0028 .0055 .0027
{3}, {2, 3, 4} .0401 .0315 .0716 .0315
{4}, {2, 3, 4} .0401 .0315 .0716 .0315
{1, 2}, {3, 4} .0969 .0837 .1806 .0837
{1, 3}, {2, 4} .0089 .0087 .0176 .0087
{1, 4}, {2, 3} .0089 .0087 .0176 .0087

In the paper, Kullback and Leibler report that the best grouping is {{1}, {2, 3, 4}}, which indicates that they might have used D(A || B) as their objective function. This is a bit strange, since everything they say in the text indicates that they are thinking exclusively in terms of the symemtrized divergence J. I don't know what to conclude from that.

Monday, June 2, 2014

Fisher: Statistical Methods and Scientific Inference (1956), chapter 5.7

In chapter V of his book on statistical inference, Fisher compares various likelihood-based approaches to coming up with a predictive distribution. Some of the details are quite obscure, and I have problems following his argument in several places.

Section 2 is dedicated to to the Bayesian solution for a series of coin flips. It contains, among other things, a strange reference to coefficients other than the binomials, suggesting that these can be replaced freely with any other suitable polynomial (p. 112). I am not quite sure what he means or how he imagines this should be justified.

Section 7 is dedicated to a particular type of frequentist prediction. The set-up is that we have observed the counts a and b (heads and tails) and that we would like to find the likelihood of a continuation having counts c and d.

In order to find this, he suggests that we compute the likelihood ratio
Pr(a, b | f) Pr(c, d | f) / Pr(a + c, b + d | f).
The logic behind this computation is that a/b is close to c/d, then the joint probability of those two ratios is approximately the same as the probability of the pooled ratio (a + c)/(b + d). If, on the other hand, they both deviate highly from the maximum likelihood estimate (in different directions), then the joint probability will be lower than the pooled reference value. However, the actual value of the parameter of the coin flip cancels out from the fraction and thus plays no direct role in the computation.

Fisher goes through an example in which a = 3, b = 16, c = 14, d = 7. In order to compute the many factorials for this example, he uses the (very rough) approximation
x! ≈ xx.
Or at least that's what I think he does — he comments that this xx is "among others having the same margins," whatever that is supposed to mean (p. 129).

N! (green) and the approximation NN (red).

At any rate, we can redo his computation using both his own approximate method and the exact formula, getting a likelihood of about .004 (Fisher's result) or .001 (the exact result).

As he suggests on page 130, we can also compile a table of likelihoods for various other values of c and d adding to 21. We can collect all of these results in a table like the following:

Count
Fisher
Exact
Bayes
0
.094
.098
.048
1
.0499
.223
.109
2
.0836
.309
.151
3
.991
.336
.164
4
.964
.311
.152
5
.817
.256
.125
6
.621
.192
.094
7
.432
.133
.065
8
.277
.085
.042
9
.164
.051
.025
10
.090
.028
.014
11
.046
.015
.007
12
.022
.007
.003
13
.009
.003
.002
14
.004
.001
.001
15
.001
.000
.000
16
.000
.000
.000
Sum
5.867
2.050
1.000

As the table shows, the exact likelihoods coincide with the Bayes estimates if we normalize them. I think this is only the case because the numbers are large enough because of an issue with the normalizing constants in a Dirichlet distribution, but I don't have time to check the details now.