Somewhat couterintuitively, Gerd Gigerenzer and Henry Brighton argue in this article that quick-and-dirty decision algorithms can actually have higher predictive accuracy than more informed methods. They explain this as an effect of the so-called "bias-variance dilemma."
The Risk of Overfitting
Put slightly differently, the issue is that a fitting process with many degrees of freedom will be more likely to respond to random noise as well as genuine patterns than a fitting process with less freedom. A large model space will thus create a better fit to the data, but possibly at the cost of responding to accidents.
They illustrate this with an example by fitting polynomials of increasing degrees to a set of data points recording the temperature in London in the year 2000. As expected, the achieve higher fit with higher degree; however, they also acquire more idiosyncrasies and less predictive power with the increasing degrees (Gigerenzer and Brighton 2009: 118):
As they note, the predictive power of the best-fit model essentially flows from the environment, not from an inherent quality of the model space itself. In a different environment, a polynomial of degree 4 could thus easily have been wildly inadequate.
Ecological Rationality
The cognitive consequence that the authors draw from these considerations is that the kind of heuristics that people use to make judgment may in fact be highly adapted to the kind of environment they live in, even though they may seem statistically crude.
They conceptualize the different learning strategies that an organism can have using two concepts from machine learning, bias and variance (p. 117). As far as I can tell from their informal description, the difference between these two is the summation order:
Bias is the distance between the average best-fit function and the underlying data-generating function (averaged over all data sets). Variance is the average distance between the individual best-fit functions and the underlying data-generating function (again, average over all data sets).
I have here assumed that function distance is a matter of summing a number of squared differences.
Tallying and Take-the-best Algorithms
As an illustration of the power of heuristics, Gigerenzer and Brighton consider an algorithm that I think is quite interesting in its own right. The idea behind the learning algorithm is to look for the single most telling difference between to objects and use that as a predictor of some quality of interest.
In the concrete example they discuss (pp. 113-116) the task was to teach a program to pick the larger of two cities based on a number of cues.
The cues were binary variables "such as whether the city had a soccer team in the major league" (p. 113). A city was thus described by a vector of binary numbers, and the training set consisted in a number of comparisons between such binary strings of the type "(0, 1, ..., 0) > (1, 1, ..., 0)".
Based on the data set, one can order the cues in decreasing order of "validity," where validity is defined is the relative frequency with which a certain difference in cues (one city has a university, the other does not) is correlated with one of the two possible judgments (the city with a university is the larger).
Confronted with a new example, the program should then runs down through the sorted list of cues looking for the first one that applies to the pair and then based its judgment solely on that single cue. (Note that cues are frequently highly dependent, so Bayesian methods cannot be employed directly.)
The results are quite impressive: As the training set approaches 50 objects, the accuracy of the program approaches 75%. I don't know, however, whether "50 objects" means 50 comparisons or 50 x 49 = 2450 comparisons. If the latter is the case, the the growth rate of the accuracy levels look a little less impressive.
Subscribe to:
Post Comments
(
Atom
)
No comments :
Post a Comment