Notebooks on Language: hierarchical Bayesian models

Showing posts with label hierarchical Bayesian models. Show all posts

Wednesday, April 2, 2014

Jeffreys: Scientific Inference, third edition (1973)

This book, first published 1931, covers much of the same ground as Jeffreys' Theory of Probability (1939), but it's shorter and easier to read.

It touches on a number of extremely interesting topics, including

the asymptotic equivalence of Bayesian inference and maximum likelihood inference (§4.3, pp. 73–74)
a Bayesian notion of "statistical test" (§3.5, pp. 54–56)
priors based on theory complexity (§2.5, especially pp. 34–39)
the convergence of predictive distributions to the truth under (very) nice conditions (§2.5)
an inductive justification of the principle of induction through hierarchical models (§3.7, pp. 58–60)
a critique of the frequency theory of probability (§9.21, pp. 193–197)
a number of other philosophical issues surrounding induction and probability (§9)

I might write about some of these issues later, but now I want to focus on a specific little detail that I liked. It's a combinatorical argument for Laplace's rule, which I have otherwise only seen justified through the use of Euler integrals.

Laplace's rule: Add a "virtual count" to each bucket before the parameter estimation.

The Generative Set-Up

Suppose that you have a population of n swans, r of which are white. We'll assume that r is uniformly distributed on 0, 1, 2, …, n. You now inspect a sample of m < n swans and find s white swans among them.

It then turns out that the probability that the next swan is white is completely independent of n: Whatever the size of the population is, the probability of seeing one more white swan turns out to be (s + 1)/(m + 1) when we integrate out the effect of r.

A population of n swans contains r white ones;
in a sample of m swans, s are white.

Let me go into a little more detail. Given n, m, and r, the probability of finding s white swans in the sample follows a hypergeometric distribution; that is,

Pr( s | n, m, r ) == C(r, s) C(n – r, m – s) / C(n, m),

where C(a, b) is my one-dimensional notation for the binomial coefficient "a choose b." The argument for this formula is that

C(r, s) is the number ways of choosing s white swans out of a total of r white swans.
C(n – r, m – s) is the number of ways of choosing the remaining m – s swans in the sample from the remaining n – r swans in the population.
C(n, m) is the total number of ways to sample m swans from a population of n.

The numerator thus counts the number of ways to select the sample so that it respects the constraint set by the number s, while the denominator counts the number of samples with or without this constraint.

Inverse Hypergeometric Probabilities

In general, binomial coefficients have the following two properties:

C(a, b) (a – b) == C(a, b + 1) (b + 1)
C(a + 1, b + 1) == C(a, b) (a + 1)/(b + 1)

We'll need both of these facts below. They can be shown directly by cancelling out factors in the factorial expression for the binomial coefficients.

One consequence is that Bayes' rule takes on a particularly simple form in the hypergeometric case:

Pr( r | n, m, s ) == Pr( s | n, m, r ) (m + 1)/(n + 1)
Pr( s | n, m, r ) == Pr( r | n, m, s ) (n + 1)/(m + 1)
Pr( s + 1 | n, m + 1, r ) == Pr( r | n, m + 1, s + 1 ) (n + 1)/(m + 1)

These equalities are, of course, saying the same thing, but I state all three forms because they will all come up.

By using the first of the two rules for binomial coefficients, we can also show that

Pr( s | n, m, r ) (r – s)/(n – m) == Pr( s + 1 | n, m + 1, r ) (s + 1)/(n + 1)

According to the last fact about the inverse hypergeometric probabilities, this also means that

Pr( s | n, m, r ) (r – s)/(n – m) == Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)

I have cancelled two occurrences of (n + 1) to arrive at this expression. I will use this fact below.

Expanding the Predictive Probability

By assumption, we have inspected s out of the r white swans, so there are r – s white swans left. We have further inspected m out of the n swans, so there is a total of n – m swans left. The probability that the next swan will be white is thus (r – s)/(n – m).

If we call this event q, then we have, by the sum rule of probability,

Pr( q | n, m, s ) == Σr Pr( q, r | n, m, s )

By the chain rule of probabilities, we further have

Pr( q | n, m, s ) == Σr Pr( q | n, m, s, r ) Pr( r | n, m, s )

As argued above, we have

Pr( q | n, m, s, r ) = (r – s)/(n – m)
Pr( r | n, m, s ) = Pr( s | n, m, r ) (m + 1)/(n + 1)
Pr( s | n, m, r ) (r – s)/(n – m) == Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)

Putting these facts together and cancelling, we get

Pr( q | n, m, s ) == (s + 1)/(m + 1) Σr Pr( r | n, m + 1, s + 1 )

I have pulled the constant factors out the summation here. Notice further that the summation is a sum of probabilities for the possible values of r. It must consequently sum to 1. We thus have

Pr( q | n, m, s ) == (s + 1) / (m + 1)

as we wanted to prove.

Wednesday, June 20, 2012

Haun and Call: "Great apes’ capacities to recognize relational similarity" (2009)

In an attempt to map the evolutionary history of analogical reasoning, this paper compares the performance of children to apes on a selection task. The task is supposed to measure the ability to recognize "relational similarity," i.e., identifying a cup by its position relative to other cups.

The Chimpanzee Experiment

The set-up is this: An experimenter hides an object in one of three cups in his end of the table; the participant then has to find an identical object in one of three cups in the other end. However, the three cups are closer together in the experimenter's end of the table, and also aligned so that proximity clues conflict with relative position clues:

There are two additional conditions that I am not considering here: One in which the three cups are connected with plastic tubes to suggest that the object can roll from one cup to another, and another condition in which they are connected by strips of gray tape, suggesting a structural parallel.

In the task with no clues, chimpanzees seem to choose the "right" cup above chance levels (p. 155). This is taken as evidence of an ability to recognize relational similarity.

A Bias Reading

I personally feel a little reluctant about counting the cup on the far left as the only "correct" choice, although it will certainly begin to appear more and more so as the system gets established through repeated trials.

However, the proximity argument suggesting the "wrong" cup is not necessarily a wrong argument in all real-life situations. What the experiments shows is thus, I think, that chimpanzees have a taxonomic bias that orangutans do not, i.e., chimps prefer one-to-one mappings even when they conflict with proximity clues.

Children, on the other hand, show a marked bias towards the middle cup wherever the experimenter hides the target object. None of the apes seem to have this symmetry bias.

A Possible Process Analysis

A hierarchical Bayesian model might peel these biases apart by identifying the following levels of modeling:

Pr(a), the absolute probability of finding the target object in cup a, regardless of where the experimenter hid it.
Pr(a|x), the conditional probability of finding the object in cup a, given that it was hidden in cup x.
Pr(a|x,m), the conditional probability of finding the object in cup a, given that it was hidden in cup x and that the experimenter is using a mapping m.

All of these layers can in principle be informed by prior knowledge. For instance, (1) might exhibit a symmetry bias, (2) a proximity bias, and (3) a taxonomic bias.

Monday, June 11, 2012

Perfors, Tenenbaum, Griffiths, Xu: "A tutorial introduction to Bayesian models of cognitive development" (2011)

This is an easily readable introduction to the idea behind Bayesian learning models, especially hierarchical Bayesian models. The mathematical details are left out, but the paper "Word Learning As Bayesian Inference" (2007) is cited as a more detailed account.

I remember Noah Goodman giving a tutorial on Bayesian models at ESSLLI 2010. The toolbox for that course centrally included the special-purpose programming language Church. I can see now that a number of video lectures by Goodman as well as Josh Tenenbaum and others are available at the website of a 2011 UCLA summer school.

The most interesting parts of the paper are, for my purposes, section 2, 3, and 4. These are the ones which are most directly devoted to giving the reader intuitions about the ins and outs of hierarchical Bayesian models.

There, the basic idea is nicely explained with a bit of bean-bag statistics borrowed from philosopher Nelson Goodman's Fact, Fiction and Forecast (1955):

Suppose we have many bags of colored marbles and discover by drawing samples that some bags seem to have black marbles, others have white marbles, and still others have red or green marbles. Every bag is uniform in color; no bag contains marbles of more than one color. If we draw a single marble from a new bag in this population and observe a color never seen before – say, purple – it seems reasonable to expect that other draws from this same bag will also be purple. Before we started drawing from any of these bags, we had much less reason to expect that such a generalization would hold. The assumption that color is uniform within bags is a learned overhypothesis, an acquired inductive constraint. (p. 308 in the published version)

The paper appeared in a special issue of Cognition dedicated to probabilistic models of cognition. There are a number of other papers in the same issue that seem very interesting.

Subscribe to: Posts ( Atom )