Notebooks on Language: machine learning

Showing posts with label machine learning. Show all posts

Thursday, July 24, 2014

Wolpert: "The Lack of A Priori Distinctions Between Learning Algorithms" (1996)

I think this might be one of the most oversold math papers I've ever read. The core idea of the paper is a tiny little nugget of common sense buried under a mountain of notation. This not only makes the paper hard to read, but also camouflages the heavy-handed assumptions that are necessary to make the argument work.

No Lunch At All

The learning scenario that Wolpert studies in the paper is one in which we want to guess the values of a function f: X → Y when evaluated on input values that did not occur in our training data. We do so by selecting an estimate h: X → Y, hoping that h(x) will agree with f(x) on the points x we haven't yet seen.

If there no restrictions on how f can be chosen, and all of the exponentially many underlying functions are possible, then no such generalization is possible. Whatever h we choose, Wolpert can always choose f so as to disagree with every single choice that our h makes outside the training set, or (if he should so desire) so as to agree with all of them.

In other words, if everything is possible, then there is no statistical dependency between the past and the future. Obviously, this means that learning is impossible.

Extremely Large and Incredibly Bold

Let me just elaborate this last reformation a bit more:

Suppose your data consists of a binary sequence of length m, and your goal is to predict the next n binary digits in the sequence. If all of the 2^{n + m} possible sequences are in your hypothesis space, then the mutual information between the test and training set is

I(train; test) = H(test) – H(test | train) = 2ⁿ – 2^n + m/2^m = 0.

In such a case, you might as well guess randomly. However, this only follows because the set of possible sequences has an entropy rate of 1, and because you only care about exactly correct guesses (otherwise the sequence 1/2, 1/2, 1/2, … might outperform random 0/1 guessing).

If It's Independent, It's Independent

To say this in a more Wolpert way, let's make the following definitions:

A loss measure L is "homogenous" if our choice of the estimate h is independent of the sum of the loss probabilities given the correct answers, that is,
Σ_y Pr{L = c | y}.
For finite |X| and |Y|, we can define a learning problem as uniform if all of the |Y|^|X| possible functions from X to Y are equally probable.
When we use a homogenous loss function in a uniform learning problem, then
Σ_y Pr{L = c | y} = 1/|Y| Σ_y Pr{L = c | y} Pr{y} = 1/|Y| Pr{L = c}
is independent of h, and therefore E[L | h] = E[L].

Combining homogeneity and uniformity is thus the same as assuming that the loss is independent of the learning algorithm.

The Don't-Get-Cute Condition

The zero-one loss function (which gives you one point for each correct answer) is "homogenous" in this sense, but it is pretty much the only one. For most other loss functions, it will often make sense to use some central value as your guess for f(x), since extreme guesses are panelized more heavily under a uniform selection scheme.

As an example, consider the loss function L(f(x), h(x)) = |f(x) – h(x)|, and suppose we are trying to guess an unknown number in the set {1, 2, 3, 4, 5}. Then the probability that L = 4, for instance, depends heavily on your estimator h, since an estimate that never returns the values 1 or 5 can never achieve this loss. Numerical deviation is therefore not "homogenous," and neither is ordinary squared deviations.

The Maximally Expensive Lunch

Naturally, Wolpert wants to discuss the relation between his own paper and the uniform law of large numbers, proven by Vapnik and Chervonenkis in the late '60s.

He thus states that if we are using a homogenous loss function in a uniform learning problem, the empirical loss on the training data is statistically independent of the loss on the test data. This should be obvious, since he is effectively assuming that the loss has a fixed distribution independent of the learning method.

Unfortunately, he goes on to state that this has "strong implications for the uniform convergence formalism for studying supervised learning" (p. 1369), although he is in fact just giving a finite-sized example of just that theory. He elaborates:

Assume that our learning algorithm has a very low VC dimension. Since [the empirical error] is low and [the sample size] is large, we might hope that our generalization error will be low, independent of any assumptions concerning the target. (This is one common way people try to interpret the VC theorems.)

However according to the results presented above, low [empirical error], large [sample size], and low VC dimension, by themselves, provide no such assurances concerning the [off-the-training-set] error (unless one can somehow a priori rule out a uniform [prior over the possible generlizations]—not to mention rule out any other prior having even more dire implications for generalization performance). (p. 1369)

But Wolpert is confusing two criteria of success here. There is a difference between identifying a long-run frequency — which is the goal in both Bernoulli's theorem and the VC theorem — and predicting the exact outcomes of a series of experiments.

The various laws of large numbers give worst-case guarantees against misidentifying a random process, but they don't say that you'll eventually be able to predict it. Even if you know with certainty that some process is a series of independent coin flips, you have no chance of guessing the next outcome above chance level.

This is not because low empirical error on the frequency estimation mischaracterizes your error on the test data. Your empirical error is just going to be uniformly high when the input sequence is a series of coin flips (assuming that all your hypotheses model the data-generating process as i.d.d.). So unless you introduce unwarranted dependence assumptions, the data will quickly reveal that there is no structure to be discovered, and thus that no hypothesis will ever do well on exact prediction.

Tuesday, June 12, 2012

Tenenbaum and Xu: "Word Learning as Bayesian Inference" (2007)

Joshua Tenenbaum and Fei Xu report some experimental findings with concept learning and simulate them in a computational model based on Bayesian inference. It refers to a 1999 paper by Tenenbaum for mathematical background.

Elements of the Model

The idea behind the model is that the idealized learner picks a hypothesis (a concept extension, a set of objects) based on a finite set of examples. In their experiments, the training sets always consist of either one or three examples. There are 45 objects in the "world" in which the learner lives: some vegetables, some cars, and some dogs.

As far as I understand, the prior probabilities fed into the computational model were based on human similarity judgments. This is quite problematic, as similarity can reasonably be seen as a dual of categories (with being-similar corresponding to being-in-the-same-category). So if I've gotten this right, then the answer is to some extent already built into the question.

Variations

A number of tweaks are further applied to the model:

The priors of the "basic-level" concepts (dog, car, and vegetable) can be manually increased to introduce a bias towards this level. This increases the fit immensely.
The priors of groups with high internal similarity (relative to the nearest neighbor) can be increased to introduce a bias towards coherent and separated categories. Tenenbaum and Xu call this the "size principle."
Applying the learned posteriors, the learner can either use a weighted average of probabilities, using the model posteriors as weights, or simply pick the most likely model and forget about the rest. The latter corresponds to crisp rule-learning, and it gives suboptimal results in the one-example cases.

I still have some methodological problems with the idea of a "basic level" in out conceptual system. Here as elsewhere, I find it question-begging to assume a bias towards this level of categorization.

Questions

I wonder how the model could be changed so as to

not have concept learning rely on preexisting similarity judgments;
take into account that similarity judgments vary with context.

Imagine a model that picked the dimensions of difference that were most likely to matter given a finite set of examples. Dimension of difference are hierarchically ordered (e.g., European > Western European > Scandinavian), so it seems likely that something like the size principle could govern this learning method.

Monday, June 11, 2012

Perfors, Tenenbaum, Griffiths, Xu: "A tutorial introduction to Bayesian models of cognitive development" (2011)

This is an easily readable introduction to the idea behind Bayesian learning models, especially hierarchical Bayesian models. The mathematical details are left out, but the paper "Word Learning As Bayesian Inference" (2007) is cited as a more detailed account.

I remember Noah Goodman giving a tutorial on Bayesian models at ESSLLI 2010. The toolbox for that course centrally included the special-purpose programming language Church. I can see now that a number of video lectures by Goodman as well as Josh Tenenbaum and others are available at the website of a 2011 UCLA summer school.

The most interesting parts of the paper are, for my purposes, section 2, 3, and 4. These are the ones which are most directly devoted to giving the reader intuitions about the ins and outs of hierarchical Bayesian models.

There, the basic idea is nicely explained with a bit of bean-bag statistics borrowed from philosopher Nelson Goodman's Fact, Fiction and Forecast (1955):

Suppose we have many bags of colored marbles and discover by drawing samples that some bags seem to have black marbles, others have white marbles, and still others have red or green marbles. Every bag is uniform in color; no bag contains marbles of more than one color. If we draw a single marble from a new bag in this population and observe a color never seen before – say, purple – it seems reasonable to expect that other draws from this same bag will also be purple. Before we started drawing from any of these bags, we had much less reason to expect that such a generalization would hold. The assumption that color is uniform within bags is a learned overhypothesis, an acquired inductive constraint. (p. 308 in the published version)

The paper appeared in a special issue of Cognition dedicated to probabilistic models of cognition. There are a number of other papers in the same issue that seem very interesting.

Tuesday, May 29, 2012

Staudacher: Use Theories of Meaning (2010)

By and large, Marc Staudacher endorses the use of evolutionary perspectives on signaling game in his dissertation.

However, in section 7.2.3, he reiterates his concern about a fact that he observed on his blog two years ago: Although there are models of signaling games with infinitely many signals, no model plausibly explains how syntactically structured signals can be aligned with semantically compositional meanings.

"But it seems to be more of a technical problem that will eventually be solved," he adds (pp. 214-15). That seems to be a sound enough intuition. The semantics of a language with finitely many atoms and finitely many relations is learnable in finite time, even if the syntactic span of the language is infinite.

A temporal difference approach could for instance do the trick. Given a signal f(x,y) = b, where b is 0 or 1, an agent could record all of the sentences f(0,0) = b, f(0,1) = b, f(1,0) = b, and f(1,1) = b with fractional observation counts determined by the subjective probability of x = 0, x = 1, y = 0, and y = 1.

Friday, February 17, 2012

Gigerenzer and Brighton: "Homo Heuristicus" (2009)

Somewhat couterintuitively, Gerd Gigerenzer and Henry Brighton argue in this article that quick-and-dirty decision algorithms can actually have higher predictive accuracy than more informed methods. They explain this as an effect of the so-called "bias-variance dilemma."

The Risk of Overfitting
Put slightly differently, the issue is that a fitting process with many degrees of freedom will be more likely to respond to random noise as well as genuine patterns than a fitting process with less freedom. A large model space will thus create a better fit to the data, but possibly at the cost of responding to accidents.

They illustrate this with an example by fitting polynomials of increasing degrees to a set of data points recording the temperature in London in the year 2000. As expected, the achieve higher fit with higher degree; however, they also acquire more idiosyncrasies and less predictive power with the increasing degrees (Gigerenzer and Brighton 2009: 118):

As they note, the predictive power of the best-fit model essentially flows from the environment, not from an inherent quality of the model space itself. In a different environment, a polynomial of degree 4 could thus easily have been wildly inadequate.

Ecological Rationality
The cognitive consequence that the authors draw from these considerations is that the kind of heuristics that people use to make judgment may in fact be highly adapted to the kind of environment they live in, even though they may seem statistically crude.

They conceptualize the different learning strategies that an organism can have using two concepts from machine learning, bias and variance (p. 117). As far as I can tell from their informal description, the difference between these two is the summation order:

Bias is the distance between the average best-fit function and the underlying data-generating function (averaged over all data sets). Variance is the average distance between the individual best-fit functions and the underlying data-generating function (again, average over all data sets).

I have here assumed that function distance is a matter of summing a number of squared differences.

Tallying and Take-the-best Algorithms
As an illustration of the power of heuristics, Gigerenzer and Brighton consider an algorithm that I think is quite interesting in its own right. The idea behind the learning algorithm is to look for the single most telling difference between to objects and use that as a predictor of some quality of interest.

In the concrete example they discuss (pp. 113-116) the task was to teach a program to pick the larger of two cities based on a number of cues.

The cues were binary variables "such as whether the city had a soccer team in the major league" (p. 113). A city was thus described by a vector of binary numbers, and the training set consisted in a number of comparisons between such binary strings of the type "(0, 1, ..., 0) > (1, 1, ..., 0)".

Based on the data set, one can order the cues in decreasing order of "validity," where validity is defined is the relative frequency with which a certain difference in cues (one city has a university, the other does not) is correlated with one of the two possible judgments (the city with a university is the larger).

Confronted with a new example, the program should then runs down through the sorted list of cues looking for the first one that applies to the pair and then based its judgment solely on that single cue. (Note that cues are frequently highly dependent, so Bayesian methods cannot be employed directly.)

The results are quite impressive: As the training set approaches 50 objects, the accuracy of the program approaches 75%. I don't know, however, whether "50 objects" means 50 comparisons or 50 x 49 = 2450 comparisons. If the latter is the case, the the growth rate of the accuracy levels look a little less impressive.

Subscribe to: Posts ( Atom )