Monday, July 28, 2014

Baker, Saxe, and Tenebaum: "Bayesian Theory of Mind" (2011)

I was referred to this paper by one of the reviewers of my own paper on multi-agent statistics, since both seemed to be about reasoning about other people's beliefs. Their paper is concerned with inferring one other person's belief-desire state from observed behavior, not with reasoning of arbitrarily high order (like I know that you know that I know…). This means that there are some issues in general multi-agent statistics that they don't have to worry about.

At any rate, the set-up they consider is the following:
  • The subject observes a little stylized scene comparable to a simple video game interface.
  • This scene features a little cartoonish character (a circle with two eyes).
  • Depending on where the character is standing, different parts of the scene will be blocked from view.
  • The scene contains two parking lots in which food trucks can park.
  • There are three different kinds of food truck that may or may not be present in the scene.
  • In each condition, the subject sees the little character move around in a certain predetermined way.
After watching this scene, the subjects were then asked to provide assessments of the little character's beliefs and food preferences. For instance, if the character could initially see a truck with Korean food, but still walked up to check the other parking lot, this must be because the person was wondering whether a more preferred kind of food was available.

The character sees the Korean truck, yet walks up and checks what else is available.

In the model, these videos were discretized and treated as a hidden Markov model with the desires (food preference ordering) and beliefs (about which trucks there might be in the parking lots) as hidden states.

At each time step, previous actions inform new states of the world.

Since there were actually a quite small space of possible routes and possible plans, the model could in fact have been simplified immensely in this case, although at the expense of generalizability.

Saturday, July 26, 2014

Wald: Sequential Analysis (1947)

I've always thought that Shannon's insights in the 1940s papers seemed like pure magic, but this book suggests that at parts of his way of thinking were already in the air at the time: From his own frequentist perspective, Wald comes oddly close to defining a version of information theory.

The central question of the book is:
How can we devise decision procedures that map observations into {Accept, Reject, Continue} in such a way that (1) the probability of wrongly choosing Accept or Reject is low, and (2) the expected number of Continue decisions is low?
Throughout the book, the answer that he proposes is to use likelihood ratio tests. This puts him strangely close to the Bayesian tradition, including for instance Chapter 5 of Jeffreys' Theory of Probability.

A sequential binomial test ending in rejection (p. 94)

In particular, the coin flipping example that Jeffreys considers in his Chapter 5.1 is very close in spirit to the sequential binomial test that Wald considers in Chapter 5 of Sequential Analysis. However, some differences are:
  • Wald compares two given parameter values p0 and p1, while Jeffreys compares a model with a free parameter to one with a fixed value for that parameter.
  • Jeffreys assigns prior probabilities to everything; but Wald only uses the likelihoods given the two parameters. From his perspective, the statistical test will thus have different characteristics depending on what the underlying situation is, and he performs no averaging over these values.
This last point also means that Wald is barred from actually inventing the notion of mutual information, although he comes very close. Since he cannot take a single average over the log-likelihood ratio, he cannot compute any single statistic, but always has to bet on two horses simultanously.

Thursday, July 24, 2014

Wolpert: "The Lack of A Priori Distinctions Between Learning Algorithms" (1996)

I think this might be one of the most oversold math papers I've ever read. The core idea of the paper is a tiny little nugget of common sense buried under a mountain of notation. This not only makes the paper hard to read, but also camouflages the heavy-handed assumptions that are necessary to make the argument work.

No Lunch At All

The learning scenario that Wolpert studies in the paper is one in which we want to guess the values of a function f: X → Y when evaluated on input values that did not occur in our training data. We do so by selecting an estimate hX → Y, hoping that h(x) will agree with f(x) on the points x we haven't yet seen.

If there no restrictions on how f can be chosen, and all of the exponentially many underlying functions are possible, then no such generalization is possible. Whatever h we choose, Wolpert can always choose f so as to disagree with every single choice that our h makes outside the training set, or (if he should so desire) so as to agree with all of them.

In other words, if everything is possible, then there is no statistical dependency between the past and the future. Obviously, this means that learning is impossible.

Extremely Large and Incredibly Bold

Let me just elaborate this last reformation a bit more:

Suppose your data consists of a binary sequence of length m, and your goal is to predict the next n binary digits in the sequence. If all of the 2n + m possible sequences are in your hypothesis space, then the mutual information between the test and training set is
I(train; test)  =  H(test)  –  H(test | train)  =  2n  –  2n + m/2m  =  0.
In such a case, you might as well guess randomly. However, this only follows because the set of possible sequences has an entropy rate of 1, and because you only care about exactly correct guesses (otherwise the sequence 1/2, 1/2, 1/2, … might outperform random 0/1 guessing).

If It's Independent, It's Independent

To say this in a more Wolpert way, let's make the following definitions:
  • A loss measure L is "homogenous" if our choice of the estimate h is independent of the sum of the loss probabilities given the correct answers, that is,
    Σy Pr{L = c | y}.
  • For finite |X| and |Y|, we can define a learning problem as uniform if all of the |Y||X| possible functions from X to Y are equally probable.
  • When we use a homogenous loss function in a uniform learning problem, then
    Σy Pr{L = c | y}  =  1/|Y| Σy Pr{L = c | y} Pr{y}  =  1/|Y| Pr{L = c}
    is independent of h, and therefore E[L | h] = E[L].
Combining homogeneity and uniformity is thus the same as assuming that the loss is independent of the learning algorithm.

The Don't-Get-Cute Condition

The zero-one loss function (which gives you one point for each correct answer) is "homogenous" in this sense, but it is pretty much the only one. For most other loss functions, it will often make sense to use some central value as your guess for f(x), since extreme guesses are panelized more heavily under a uniform selection scheme.

As an example, consider the loss function L(f(x), h(x)) = |f(x) – h(x)|, and suppose we are trying to guess an unknown number in the set {1, 2, 3, 4, 5}. Then the probability that L = 4, for instance, depends heavily on your estimator h, since an estimate that never returns the values 1 or 5 can never achieve this loss. Numerical deviation is therefore not "homogenous," and neither is ordinary squared deviations.

The Maximally Expensive Lunch

Naturally, Wolpert wants to discuss the relation between his own paper and the uniform law of large numbers, proven by Vapnik and Chervonenkis in the late '60s.

He thus states that if we are using a homogenous loss function in a uniform learning problem, the empirical loss on the training data is statistically independent of the loss on the test data. This should be obvious, since he is effectively assuming that the loss has a fixed distribution independent of the learning method.

Unfortunately, he goes on to state that this has "strong implications for the uniform convergence formalism for studying supervised learning" (p. 1369), although he is in fact just giving a finite-sized example of just that theory. He elaborates:
Assume that our learning algorithm has a very low VC dimension. Since [the empirical error] is low and [the sample size] is large, we might hope that our generalization error will be low, independent of any assumptions concerning the target. (This is one common way people try to interpret the VC theorems.) 
However according to the results presented above, low [empirical error], large [sample size], and low VC dimension, by themselves, provide no such assurances concerning the [off-the-training-set] error (unless one can somehow a priori rule out a uniform [prior over the possible generlizations]—not to mention rule out any other prior having even more dire implications for generalization performance). (p. 1369)
But Wolpert is confusing two criteria of success here. There is a difference between identifying a long-run frequency — which is the goal in both Bernoulli's theorem and the VC theorem — and predicting the exact outcomes of a series of experiments.

The various laws of large numbers give worst-case guarantees against misidentifying a random process, but they don't say that you'll eventually be able to predict it. Even if you know with certainty that some process is a series of independent coin flips, you have no chance of guessing the next outcome above chance level.

This is not because low empirical error on the frequency estimation mischaracterizes your error on the test data. Your empirical error is just going to be uniformly high when the input sequence is a series of coin flips (assuming that all your hypotheses model the data-generating process as i.d.d.). So unless you introduce unwarranted dependence assumptions, the data will quickly reveal that there is no structure to be discovered, and thus that no hypothesis will ever do well on exact prediction.

Tuesday, July 22, 2014

Jaffe and Quinn: "Theoretical Mathematics" (1993)

Shaking the moralistic finger at sloppy mathematics, this paper suggests that the field should be subjected to certain "prescriptions" in order to avoid errors, priority disputes, and confusions about the reliability of proofs.

As a means to this end, Jaffe and Quinn suggest that we understand the activity of writing mathematical proofs according to the analogy
theoretical physics : experimental physics = speculative math : "rigorous" math
This leads them to adopt the odd and misleading term "theoretical mathematics" for what is essentially mathematics without detailed proofs. They suggest that some time in the future, "theoretical mathematics" might branch off from rigorous mathematics and become a discipline of its own, like theoretical physics.

This seems to misconstrue the motivations people have for writing heuristic or intuitive proofs in mathematics.

Most professional mathematicians prefer to give detailed and formal proofs whenever they can, and they only deviate from that norm when they absolutely have to. For a theoretical physicist, on the other hand, there is no shame in having no experimental expertise. The analogy thus seems quite misleading, and trying to drive a wedge in between the practice of conjecturing and the practice of proving mathematical claims consequently misconstrues the ethos of mathematics.

Another thing which is never dwelled upon in the paper (but mentioned in Jeremy Gray's response to the paper) is that even "rigorous" proofs can contain mistakes.

This can be the case even for computer-verified proofs, since there is a very realistic chance of bugs, typos, or even misinterpretations of the code (e.g., subtle differences between an axiom system and the geometric domain it is intended to formalize).

When Jaffe and Quinn assume that we can recognize a "rigorous" proof when we see it, and that rigorous proofs are always correct, they are thus thinking in terms of about a perfect-world mathematics, not actual-world mathematics. As a consequence, they imagine that the whole enterprise of mathematics could be "corrected" into a steadily advancing literature of absolutely certain theorems piled on top of each other.

Of course, everybody wants to give the best arguments they can, and honesty about the quality of the argument would be laudable. But in a world where quality is gradable, it would seem that any step taken toward a realization of Jaffe and Quinn's dream would kill off the whole range of mutant forms in between complete guesswork and completely "rigorous" math (however we measure rigor). Since some of these mutants would have gone on to become the standard of future mathematics, this would almost certainly be a scientific loss, in addition to being very boring.