Thursday, June 12, 2014

Kullback and Leibler: "On Information and Sufficiency" (1951)

This paper is mostly known for introducing the concept of divergence into statistics. It's quite notation-heavy but actually quite straightforward if you've already solved exercise 3.9 in Cover and Thomas' textbook. Everything is about log-likelihood ratios that fly off into infinity.

The paper is concerned quite a lot with the question of sufficient statistics. An interesting illustration in the last paragraph (§6) concerns the problem of how to choose the best non-sufficient statistic when we must (for some extrinsic reason).

More specifically, the set up is the following: You are observing samples from a categorical distribution with four different categories, and you want to distinguish the two following hypotheses:
 x 1 2 3 4 Pr(x | A) 0.25 0.25 0.25 0.25 Pr(x | B) 0.5 0.3 0.1 0.1

However, you are not actually allowed to see the original data set; instead, you have to lump the categories into two groups, and you will then be told which group the observation fell. (This corresponds to using an indicator function as your statistic.)

In order to solve this exercise, they quantify the informativeness of the statistic in terms of the symmetrized divergence between the two hypotheses:
J(A, B)  ==  D(A || B) + D(B || A).
I guess a more cautious measure would have been the minimum of the two, since this would provide a lower bound on the growth rate of the likelihood ratio whichever of the two hypotheses were true.

Using base 10 logarithms, the divergence from the uniform to the skewed distribution (A to B) is .1039, and the divergence from the skewed to the uniform (B to A) is .0947. The symmetrized divergence J(A, B) is thus about .1986, and the minimum is .0947.

Grouping the categories in various ways, we can also find the following divergences:

 Grouping D(A || B) D(A || B) J(A, B) min {}, {1, 2, 3, 4} 0 0 0 0 {1}, {2, 3, 4} .5068 .0635 .1193 .0635 {2}, {1, 3, 4} .0027 .0028 .0055 .0027 {3}, {2, 3, 4} .0401 .0315 .0716 .0315 {4}, {2, 3, 4} .0401 .0315 .0716 .0315 {1, 2}, {3, 4} .0969 .0837 .1806 .0837 {1, 3}, {2, 4} .0089 .0087 .0176 .0087 {1, 4}, {2, 3} .0089 .0087 .0176 .0087

In the paper, Kullback and Leibler report that the best grouping is {{1}, {2, 3, 4}}, which indicates that they might have used D(A || B) as their objective function. This is a bit strange, since everything they say in the text indicates that they are thinking exclusively in terms of the symemtrized divergence J. I don't know what to conclude from that.