Wednesday, June 20, 2012

Haun and Call: "Great apes’ capacities to recognize relational similarity" (2009)

In an attempt to map the evolutionary history of analogical reasoning, this paper compares the performance of children to apes on a selection task. The task is supposed to measure the ability to recognize "relational similarity," i.e., identifying a cup by its position relative to other cups.

The Chimpanzee Experiment

The set-up is this: An experimenter hides an object in one of three cups in his end of the table; the participant then has to find an identical object in one of three cups in the other end. However, the three cups are closer together in the experimenter's end of the table, and also aligned so that proximity clues conflict with relative position clues:


There are two additional conditions that I am not considering here: One in which the three cups are connected with plastic tubes to suggest that the object can roll from one cup to another, and another condition in which they are connected by strips of gray tape, suggesting a structural parallel.

In the task with no clues, chimpanzees seem to choose the "right" cup above chance levels (p. 155). This is taken as evidence of an ability to recognize relational similarity.

A Bias Reading

I personally feel a little reluctant about counting the cup on the far left as the only "correct" choice, although it will certainly begin to appear more and more so as the system gets established through repeated trials.

However, the proximity argument suggesting the "wrong" cup is not necessarily a wrong argument in all real-life situations. What the experiments shows is thus, I think, that chimpanzees have a taxonomic bias that orangutans do not, i.e., chimps prefer one-to-one mappings even when they conflict with proximity clues.

Children, on the other hand, show a marked bias towards the middle cup wherever the experimenter hides the target object. None of the apes seem to have this symmetry bias.

A Possible Process Analysis

A hierarchical Bayesian model might peel these biases apart by identifying the following levels of modeling:
  1. Pr(a), the absolute probability of finding the target object in cup a, regardless of where the experimenter hid it.
  2. Pr(a|x), the conditional probability of finding the object in cup a, given that it was hidden in cup x.
  3. Pr(a|x,m), the conditional probability of finding the object in cup a, given that it was hidden in cup x and that the experimenter is using a mapping m.
All of these layers can in principle be informed by prior knowledge. For instance, (1) might exhibit a symmetry bias, (2) a proximity bias, and (3) a taxonomic bias.

Monday, June 18, 2012

Sattah and Tversky: "Additive Similarity Trees" (1977)

In this paper, Shmuel Sattah and Amos Tversky describe a FORTRAN program that can produce visual representations of similarity data. It constructs a tree in which similar things are grouped together, and branch heights show differences. The algorithm is applied to example data sets, and its faithfulness as a representation of the original similarity matrix is compared to other visualization techniques.

The paper owes a lot to the mathematical legwork done by Peter Buneman. I'll try here to reconstruct the algorithm that Sattah and Tversky propose as clearly as possible and without much commentary.

Translations Between Distances and Graphs

All distance measures can be represented by a graphs. We can always simply draw a line between every pair of objects in our domain and assign the right "length" to the line.

However, not all kinds of graphs can represent all distance measures on a domain, and not all graphs can be drawn on paper in a way that maps arc "lengths" to distances on the paper. It is thus a substantial task to identify the classes of distance measures that can be represented by certain kinds of drawings.

In order to investigate this question, a number of graphs classes can be defined, and arithmetical conditions can then be given which selects the exact class of distance matrices that the graph type can represent. The most interesting cases are tree representations (cycle-free graphs) and totally connected graphs (corresponding to embedding the domain in 2D space).

Additive Trees and Ultrametric Trees

There are two types of tree that can represent certain classes of distances measures well, ultrametric trees additive trees.

An ultrametric tree is a rooted binary tree in which the distance between the root and any particular leaf always is the same. It can thus be represented with the root on top and the leaves all at the same level.

An additive tree is an unrooted binary tree with arbitrary arc lengths.


As can be seen from the definitions, the set of ultrametric trees is a subset of the set of additive trees (given that we forget about the roots). The best additive tree will thus always have a better fit to a distances measure than the best ultrametric tree.

Arithmetic Conditions

There are two inequalities which accurately describe the sets of distance matrices that can be translated into trees of the two kinds. These are the ultrametric inequality and the additive inequality.

A distance matrix satisfies the ultrametric inequality if the following holds for all A, B, and C:
d(A,B) ≤ max{d(A,C), d(B,C)}
It fulfills the tree inequality (or additive inequality, or four-point condition) if the following holds for all A, B, C, and D:
d(A,B) + d(C,D) ≤ max{d(A,C) + d(B,D), d(A,D) + d(B,C)}
For completeness, compare these two conditions with the familiar triangle inequality:
d(A,B) ≤ d(A,C) + d(B,C)
Logically, the relations between these conditions are:
  • The ultrametric inequality implies the tree inequality.
  • The tree inequality implies the triangle inequality.
The ultrametric inequality is thus the most restrictive of the three, and the triangle inequality the least restrictive. (Four distinct points on the number line will satisfy the tree inequality, but not the ultrametric inequality. The four code words 00, 01, 10, and 11 will, using Hamming distance, satisfy the triangle inequality, but not the tree inequality.)

Representation

We can then state the two representation theorems that are the real point of interest:
  • A distance matrix can be represented as an ultrametric tree if and only if it fulfills the ultrametric inequality.
  • A distance matrix can be represented as an additive tree if and only if it fulfills the tree inequality.
Additive trees are thus the more flexible kind of representation, as also noted above. (See Peter Buneman's 1974 paper on the topic.)

As Sattah and Tversky note, any constructive proof of the additive representation theorem will, per definition, also be a little machine that can be used to translate distance matrices into trees. However, they are interested in a method that will construct a tree from any matrix and do well when the additive inequality is satisfied or almost satisfied. I am not sure whether their algorithm produces perfect representation when the tree inequality is perfectly satisfied.

The Construction Algorithm

Sattah and Tversky's algorithm for creating an additive tree from a matrix of distance data contains two parts, a construction method and an estimation method. The construction method chooses the shape of the tree, and the estimation method choose the arc lengths.

The construction method can be seen as a bottom-up lumping procedure, such as, for instance, Huffman's algirthm for constructing balanced trees. If we think of a single data point as a tree with only one node, then the method can further be thought of as a recursive algorithm for joining more and more subtrees together until everything is connected.

The idea is roughly this:
  1. A set {x,y,z,w} of four points can be partitioned into two pairs in three different ways. The one that gives the lowest sum of within-pair distances is the "best." We thus run through all quadruples and give 1 point to a pair {x,y} every time it occurs in a "best" partition.
  2. Now sort the pairs according to the number of points they have gotten and start by picking the pair {x,y} with the highest score. From this pair, you produce the new subtree z = ½(x + y), and then remove all other pairs that contain x or y. Continue creating new trees this way until your ordered list of pairs is empty.
  3. You have now produced a new list of trees, half the size of the original domain. Repeat the aggregation procedure above with these new subtrees as inputs until there are less then three subtress left on the list (in which case, you can just join these, and you're done).
I presume that the distance between two trees z = ½(x + y) and c = ½(b + a) should be computed as the average of the distances between one point in {x, y} and one point in {a,b}; at least, the distance between a point a and a tree z = ½(x + y) is the average distance between the point a and a point in {x,y}.

James Corter, in his PASCAL implementation of the algorithm, uses a variation in which the best of the three partitions gets 2 points, and the second-best gets 1 point. I am not sure what the differences are between this method and the original.

An Example

Consider this (pretty arbitrarily constructed) matrix of distances between five objects:

A B C D E
A 0 1 2 3 4
B 1 0 4 3 2
C 2 4 0 5 4
D 3 3 5 0 1
E 4 2 4 1 0

Since there are 5 elements in this domain, we have to consider 5 different quadruples and in each case choose the "best" partition. For instance, in considering the quadruple {A,B,D,E} we have to compare the following three statistics:
  1. d(A,B) + d(D,E)  =  1 + 1  =  2
  2. d(A,D) + d(B,E)  =  3 + 2  =  5
  3. d(A,E) + d(B,D)  =  4 + 3  =  7
Since the first of these numbers is the smallest one, the partition of {A,B,D,E} into {A,B} and {D,E} is the "best" one.

Going through all quadruples this way yields the following list of optimal partitions:

Quadruple Pair 1 Pair 2
A,B,C,D A,C B,D
A,B,C,E A,C B,E
A,B,D,E A,B D,E
A,C,D,E A,C D,E
B,C,D,E B,C D,E

The two pairs {A,C} and {D,E} thus both receive a score of 3 points, while the four other pairs mentioned in the table each receive a score of 1 point each. The remaining four pairs receive a score of 0 points.

The list of candidate pairs thus begins:
{A,C}, {D,E}, {A,B}, {B,C}, {B,D}, {B,E}, {C,D}, {C,E}, ...
Following the algorithm, we then aggregate A and C into the tree F = ½(A + C). This leaves us with the list
{D,E}, {B,D}, {B,E}, {C,D}, {C,E}, ...
We consequently create the subtree G = ½(D + E).

This exhausts the list of pairs. We can then feed the three subtrees F, G, and B into the algorithm and start over, but since there are less than four points left, we simply join them into a tree, and we are done:



The Estimation Algorithm

It is not entirely clear to me how Sattah and Tversky assign lengths to the arcs of the additive tree once it has been constructed.

It is clear that all distances depend linearly on the arc lengths, so for data sets that satisfy the tree inequality, we just have to solve some linear equations, which is doable even if cumbersome. However, if the data set is not capable of being fitted onto a tree, it is less clear how we choose the one that minimizes the mean squared error (which is the statistic the Sattah and Tversky use as a success criterion). However, a set of Lagrange multipliers ought to do the trick.

Sattah and Tversky, however, seem to use some other method in mind (p. 328). They claim that individual arc lengths of the least-square estimate can be computed from (i) the number of points each end of the arc (ii) the average distance between a point in one end of the arc, and a point in the other. However, they don't describe the method that falls out of this observation further.

Vladimir Makarenkov and Pierre Legendre (2001), though, seem to spell out a method in detail. Their algorithm is based, as far as I could see, on adjusting heights from the bottom of the tree first, moving upwards.

Barsalou: "The instability of graded structure" (1987)

Lawrence Barsalou summarizes the findings from a large number of categorization experiments performed by him and his colleagues 1980-86. The upshot is that prototypicality effects are highly unreliable and context-sensitive.

The Instability of Prototype Effects

Barsalou describes the following types of variability between judgments of prototypicality:
  1. Pages 108-109: Contrary to what Rosch claims, the between-subject reliability of similarity judgments is generally low, with an average correlation of around 50%. Rosch (1975) and Armstrong, Gleitman, and Gleitman (1983) only achieve their 90% correlation coefficients by comparing group averages rather than individuals.
  2. Pages 109-111: People differ in their similarity judgments -- students, for instance, differ from professors. Yet, people are surprisingly good at simulating the similarity estimates of other people when asked to make similarity judgments from their point of view.
  3. Pages 111-112: People are not very stable in their judgments. Their self-correlation is about 92% after one hour, 87% after one day, and around 80% after one week or more.
I would imagine that the task about "taking the point of view" of someone else is highly sensitive to tiny details in the test materials. Barsalou does not discuss the methodology in detail, but I can see that the 1984 research report that describes the experiment is available at his website.

The Alternative Theory

In order to fix the theoretical problems that result from this instability of prototype effects, Barsalou suggests that concepts should be seen as "temporary constructs in working memory that are tailored to current situations" (p. 120). This does indeed allow for a more anarchistic type of conceptual system, but it is also a very weak theory (i.e., it is consistent with almost all types of behavior).

My suspicion is here -- as often when "representations" start to disintegrate into a disorderly pile of highly unstructured improvisations -- that the whole set-up somehow captures the wrong phenomenon. This can be true both of wild cognition (how do you decide, in conversation, whether or not to call something, say, a tool?) and in the experimental situation (how do subjects construe the questions they get, and what do they think that the experimenter wants them to do?).

The context-sensitivity seems, I think, to suggest that categorization is a fairly "high-level" type of cognition, in spite of the claims of Rosch and others. People might just use quite intelligent, deliberate, and context-sensitive strategies for picking words, finding or ignoring similarities, and the like. But I realize this is pretty weak, too.

Thursday, June 14, 2012

Lakoff: "Cognitive models and prototype theory" (1987)

Lakoff argues that the prototype effects exist because concepts are defined by several partly overlapping "cognitive models." A possible example of a mother might then fit some of the models (the "genetic mother") and not others ("the wife of the father"). This may lead to graded membership judgments.

Structure of the Paper

The paper is a little bit difficult to navigate in, as it contains a large number of sections which are all on the same level of organization.

The sections differ in content and length. Their headings are:
  1. Untitled introduction
  2. Interactional properties
  3. Cognitive models
  4. Graded models
  5. The idealized character of cognitive models
  6. Cognitive models versus feature bundles
  7. Mother
  8. Metonymic models
  9. Metonymic sources of prototype effects
  10. The housewife stereotype
  11. Working mothers
  12. Radial structures
  13. Some kinds of metonymic models
  14. Social stereotypes
  15. Typical examples
  16. Ideals
  17. Paragons
  18. Generators
  19. Submodels
  20. Salient Examples
  21. Radial categories
  22. Japanese hon
  23. Categories of mind, or mere words
  24. What is prototype theory
  25. The core + identification proposal
  26. Osherson and Smith
  27. Armstrong, Gleitman, and Gleitman
  28. Conclusion

Cognitive Clusters

His own theory on cognitive models is most clearly explained in section 6 and 7 (pp. 66-70). His idea is that concepts are defined in terms of a cluster of competing models -- for instance, a salient example, an idealized picture, and a positive paradigm.

The effect seems oddly close to the weighted lists of attributes used by, e.g., Linda Koleman and Paul Kay (1981). But he insists that bundles of cognitive models are empirically distinguishable from bundles of features. He refers to Eve Sweetser (1987) for support of this claim, but does not discuss the evidence.

Sections 14 through 20 (and perhaps section 21?) are intended to give examples of how "cognitive models" can look. They can look like a lot of different things, it appears.

An Alternative Theory

Sections 25, 26, and 27 criticize a "reactionary" counterproposal.

According to this theory, category judgments can be made in two ways, by deliberation or by quick-and-dirty heuristics. Prototype effects are then, as far as I understand, only present when the heuristic method is used.

Lakoff identifies a comment from his own 1972 paper on hedges (journal version, 1973) as the inspiration for this new theory. He finds this "ironic."

The arguments he proposes against this two-method theory oddly resembles arguments in favor of it (e.g., p. 92). His main concern, if I understand it correctly, is the metaphysical assumptions of the theory rather than any specific empirical problem.

Rosch: "Principles of Categorization" (1978)

Elanor Rosch's contribution to Cognition and Categorization really consists of two independent parts: An overview over her experiments investigating basic-level categories (pp. 30-35), and an overview of her experiments with prototype effects (pp. 35-41). In will only deal with the first part now.

Cue Validity

The most central concept in Rosch's discussion of basic-level categories is the notion of the cue validity. This is defined for a category such as "bird," which is more or less reliably identified by cues such as "wings." She explains:
The cue validity of an entire category may be defined as the summation of the cue validities for that category of each of the attributes of the category. (pp. 30-31)
This immediately raises two questions:
  1. Do all cues count in the summation with equal weight? There are infinitely many possible cues and only a few highly valid ones. This suggests that more explicit assumptions about "salience" are needed.
  2. With what weight do the various members of a category contribute to the average? Equally? Weighted by the frequency of the linguistic label? Weighted by the frequency of the thing?
While these questions may seem like technical remarks, they do in fact relate to some deeper issues that I will mention below.

The Ambiguity of "Basic"

There are two competing characterizations of "basic" in Rosch's work, an ostensive and a perceptual. It's not always clear which one she is taking as definitive, and this sometimes introduces problems.

Both definitions apply to concept trees and are meant to pick out a particular depth in such a tree. They do so by locating the level of abstraction at which either
  1. the categories "car," "chair," "tomato," and "hammer" are found; or
  2. average cue validity is maximized.
My worry is that her cross-cultural, developmental, and evolutionary claims may turn out to be tautologies when we look closer at the ups and downs of her theory.

For instance, if the "basic" means "maximal cue validity," then of course children learn names from this level first. On the other hand, if Rosch gets to pick what counts as "basic" in each branch of the English category system ("chair," "car," "tomato," ...), then she can obviously just pick the level that fulfills the second definition.

Learned Perception

The fact that she might unknowingly be making the tautological point that "normal things are normal" is hinted at when she comments that English-speakers tend to be less able to distinguish between plants than the ostensive definition suggests.

This is observation was echoed more recently my Jerome Feldman:
For many city dwellers, tree is a basic category—we interact the same way with all trees. But for the professional gardener, tree is definitely a superordinate category (Feldman 2006: 186)
With those kinds of qualifications, basic level categories will certainly guaranteed to have all of the properties that Rosch claims. But any claim about their universal centrality will also become an empty verbalism.

Note how this also ties in with the sticky issue of trained perception:
One influence on how attributes will be defined by humans is clearly the category system already existent in the culture at a given time. This our segmentation of a bird's body such that there is an attribute called "wings" may be influenced not by perceptual factors [...] but also by the fact that at present we already have a cultural and linguistics category called "birds." (p. 29)
She is apparently aware of this problem, but not willing to face the implication that complex cues like plumage are themselves categories that are open-ended and ambiguous.

Mutual Dependence and Iterated Learning

She does note, however, that attributes might be extracted from categories just as well as categories might be based on attributes. However:
Unfortunately, to state the matter in such a way is to provide no clear place at which we can enter the system as analytical scientists. What is the unit with which to start our analysis? (p. 42)
To me, this suggests a game-theoretical analysis. A category system is invented by people, but also has to be transmitted; fixed points in such an iterated learning process will be the systems that trade off difficulty of acquisition for pragmatic necessity, I guess.

This process could probably be modeled relatively easily in a multi-agent system with a set of Bayesian learners. However, such a model will probably be highly sensitive to the assumptions made about the environment of learning (e.g., the frequency of birds and the frequency of winged-ness).

Tuesday, June 12, 2012

Berlin: "Ethnobiological Classification" (1978)

Brent Berlin considers data from two "prescientific" cultures and concludes that their category systems are based on appearance above the basic level, and based on utility below.

Since appearances "cry out to be named" (p. 11) not all plant names will reflect practical concerns:
This finding would seem to controvert the view that preliterate man names and classifies only those organisms in the environment that have some immediate functional significance for survival. More than one-third of the named plants in both Tzeltal and Aguaruna, for example, lack any cultural utility, and these are not pestiferous plants that must be avoided due to poisonous properties or the like. (p. 11)
A couple of times in the paper (e.g., p. 20), he raises the issue of simple versus compound names for categories. He seems to think that the basic level should, normally and on average, be the lowest level that has simple names (tree, pine, etc.), but he doesn't discuss the topic specifically and in detail, perhaps he didn't have enough quantitative data for a meaningful claim.

His conclusion is that
folk biological classification is based on a recognition of natural discontinuities in the biological world that are considered to be similar or different because of gross, readily perceivable characteristics of form and behavior. (p. 24)

Rosch and Lloyd: Cognition and Categorization (1978)

Read this book, and you will understand everything about where cognitive semantics comes from, and how it sees itself. I feel like quoting the whole thing word for word.

All the seeds of future greatness and future crises are visible here – as well a firm rooting in the AI and cognitive psychology of the golden age of frog neurons and cat retinas in the 1950s.

Take a look, for instance, at the names of some of the contributors: Brent Berlin, Elanor Rosch, Amos Tversky, George Miller – quite a cast. Naomi Quinn and Dan Slobin, too, are involved in the background, as members of Social Science Research Council's Committee on Cognitive Research – the Council being an institution that "the book reflects the aims of" (p. vii).

Revolution!

The trumpets are already out in the blurb on the flap, with the promise of "a conceptual revolution overtaking the study of language and cognition." A more detailed narrative is unfolded in the preface:
In the spring of 1976, a small group of psychologists, linguists, and anthropologists met at Lake Arrowhead, California, in a conference sponsored by the Social Science Research Council to discuss the nature and principles or category formation. Participants coming from the East Coast talked about Roger Brown's memorial lecture for Eric Lenneberg given a few days earlier. (p. vii)
Note the literary voice – "Four score and seven years ago ..." And then an indirect reference to Eric Lenneberg, just to to put some distance to the "recalcitrant cultural relativists" that Berlin grumbles about (p. 12).

The preface continues:
Roger Brown had chosen had chosen to speak about the new paradigm of reference using research in the domain of color. But research in fields such as ethnoscience, perception, and developmental psychology was beginning to appear and might also have been cited to support the claim that categorization, rather than being arbitrary, may be predicted and explained. (p. vii)
The steady stride of scientific progress, in other words. No corner of the life of "man" will evade the searchlight of scientific attention.

"Scientists now realize…"

The boogieman is also largely the same as in 1960 – behaviorism, empiricism, relativism. The Introduction thus states:
In the stimulus–response learning paradigm that dominated American psychology in the first half of the twentieth century, both the stimulus and the response were dealth [sic] with as arbitrary systems; the focus was on primarily on the connection between them. In developmental psychology, children were considered beings born into a culture in which categories and stimuli were already determined by the adult world. Anthropology, which might have sought universal principles of human experience, under the influence of Boazian [sic] cultural relativism, concentrated on cultural diversity and the arbitrary nature of the definition of categories. (p. 2)
However, somewhat confusingly, the over-rationalizing "Aristotelian" picture of knowledge is also wrong:
If other thought processes such as imagery, ostensive definition, reasoning by analogy to particular instances, or the use of metaphors were considered at all, they were usually relegated to lesser beings such as women, children, primitive people, or even to nonhumans. (p. 2)
While all of these are true observations, to me they look like a motivation for something other than a research program hailing "universal principles" and biological reductionism.

Rosch's Afterthought

But maybe this should just one of the germs of contradiction in "second-generation" cognitive science. Rosch's abrupt change of attention in the very last part of her paper certainly seems to say so.

There, in the section "The Role of Objects in Events," she falls into an almost Heideggerian mode of thought, contemplating the "events of daily human life" and the "flow of experience" (p. 43).

Not for long, though. Soon, she gets the idea of treating everyday life events according to the same principles as she had applied to chairs and cars and vegetables (p. 44). So there we are, back in familiar territory.

Tversky: "Features of Similarity" (1977)

Tversky argues that objects should be represented as feature bundles, and that the similarity of the feature bundles equals the measure of their overlap minus the measures of the two disjoint parts.

He provides large amounts of evidence that this more versatile (and vague) scheme is a necessary corrective to the metric conception of similarity.

Finding the Relevant Dimensions

The general idea is "feature matching," a pragmatic process relying on background notion of relevance:
When faced with a particular task (e.g., identification or similarity assessment) we extract and compile from our data base a limited list of relevant features on the basis of which we perform the required task. (p. 329)
This process can violate metric properties because the basis for the similarity may be different in different cases:
Jamaica is similar to Cuba (because of geographical proximity); Cuba is similar to Russia (because of their political affinity); but Jamaica and Russia are not similar at all. (p. 329)
This sounds like Wittgenstein, and in the last section of the paper, he does in fact get a citation, during a discussion of Elanor Rosch's work (p. 348).

Symmetry and Reversibility

Symmetry, too, is problematic. Some prototypical examples of certain categories seem to make the central features of the category shine brightly and thus attract attention; this produces higher similarity judgements:
We tend to select the more salient stimulus, or the prototype, as a referent, and the less salient stimulus, or the variant, as a subject. We say "the portrait resembles the person" rather than "the person resembles the portrait." We say "the son resembles the father" rather than "the father resembles the son." (p. 328)
However, in certain cases, both objects may have the stereotypical character of a paradigm case:
Sometimes both directions are used but they carry different meanings. "A man is like a tree" implies that man has roots; "a tree is like a man" implies that the tree has a life history. "Life is like a play" says that people play roles. "A play is like life" says that a play can capture the essential elements of human life. (p. 328)
Whether or not Tversky selects the right features here is doubtful. But his point about feature selection is in general true, I suppose.

Context-Dependence

A large part of Tversky's paper is dedicated to compiling evidence against the symmetry of similarity judgments, and to showing prototype effects. This part of the paper is slightly dated, especially since he does not reprint any of his data, only the test statistics.

However, his examples of context-dependent similarity (pp. 340-344) are more interesting from a contemporary perspective. These include for instance the experiments in which he asked subjects to split a set of four objects into two pairs. This indirectly pointed to the context-sensitivity of feature selection.

One way he did this was by asking people to pick the a cartoon drawing of a face according to similarity. So his subjects would get a neutral face and a set of three frowning or smiling faces with an instruction to pick the face most similar to the neutral one:


As the numbers indicate, the members of the reference set mattered hugely for the judgment of the leftmost and rightmost face, even though these were held constant across the two conditions.

This seems to suggest that merely having two frowney or smiley faces in the reference set implicitly tells the subjects frowns or smiles are essential, stables features, rather than facial expressions drawn from a random distribution. A neutral face will then have a much smaller likelihood of coming from that "category."

The other "category," however, only contains a single example and thus yields higher likelihood levels, as it suggests that more variance within the category might have been possible.

Reformulations

If we set <a,b,c> = <neutral, frown, smile> and <d,e> = <dot-eye, circle-eye>, then the data set can be rephrased as follows:
Which of the following three pairs is most similar to <a,d>?
Condition 1: <b,d>, <c,e>, or <c,d>?
Condition 2: <b,d>, <b,e>, or <c,d>?
Notice that we get a symmetry here: If we swap the names b and c and rearrange the items, the two sets turn out in fact to be the same. Yet, we don't see symmetric choices.

I wonder how abstractly this prompt could be presented to a subject and produce results like those Tversky got. Imagine for instance the following formulation:
Which of the following is most similar to a white mouse?
Condition 1: a black mouse, a brown rat, or a brown mouse?
Condition 2:  a black mouse, a black rat, or a brown mouse?
Whatever the answer is, there is a problem with this design, as it puts too much weight on forced partitioning of the four faces. A better method is used in the experiment reported on page 344. This can essentially be thought of as the following three conditions:
Condition 1:
How similar is Chile to Venezuela?
How similar is Guatemala to Uruguay?
(etc.)
Condition 2:
How similar is Sweden to Norway?
How similar is Finland to Denmark?
(etc.)
Condition 3:
How similar is Chile to Venezuela?
How similar is Sweden to Norway?
(etc.)
With this set-up, Tversky reports to have found a higher average similarity in (what corresponds to) condition 3. This is explained by the fact that the context foregrounds the geographical region as a cue in that case, but not in the two others.

Tenenbaum and Xu: "Word Learning as Bayesian Inference" (2007)

Joshua Tenenbaum and Fei Xu report some experimental findings with concept learning and simulate them in a computational model based on Bayesian inference. It refers to a 1999 paper by Tenenbaum for mathematical background.

Elements of the Model

The idea behind the model is that the idealized learner picks a hypothesis (a concept extension, a set of objects) based on a finite set of examples. In their experiments, the training sets always consist of either one or three examples. There are 45 objects in the "world" in which the learner lives: some vegetables, some cars, and some dogs.

As far as I understand, the prior probabilities fed into the computational model were based on human similarity judgments. This is quite problematic, as similarity can reasonably be seen as a dual of categories (with being-similar corresponding to being-in-the-same-category). So if I've gotten this right, then the answer is to some extent already built into the question.

Variations

A number of tweaks are further applied to the model:
  • The priors of the "basic-level" concepts (dog, car, and vegetable) can be manually increased to introduce a bias towards this level. This increases the fit immensely.
  • The priors of groups with high internal similarity (relative to the nearest neighbor) can be increased to introduce a bias towards coherent and separated categories. Tenenbaum and Xu call this the "size principle."
  • Applying the learned posteriors, the learner can either use a weighted average of probabilities, using the model posteriors as weights, or simply pick the most likely model and forget about the rest. The latter corresponds to crisp rule-learning, and it gives suboptimal results in the one-example cases.
I still have some methodological problems with the idea of a "basic level" in out conceptual system. Here as elsewhere, I find it question-begging to assume a bias towards this level of categorization.

Questions

I wonder how the model could be changed so as to
  • not have concept learning rely on preexisting similarity judgments;
  • take into account that similarity judgments vary with context.
Imagine a model that picked the dimensions of difference that were most likely to matter given a finite set of examples. Dimension of difference are hierarchically ordered (e.g., European > Western European > Scandinavian), so it seems likely that something like the size principle could govern this learning method.

Monday, June 11, 2012

Perfors, Tenenbaum, Griffiths, Xu: "A tutorial introduction to Bayesian models of cognitive development" (2011)

This is an easily readable introduction to the idea behind Bayesian learning models, especially hierarchical Bayesian models. The mathematical details are left out, but the paper "Word Learning As Bayesian Inference" (2007) is cited as a more detailed account.

I remember Noah Goodman giving a tutorial on Bayesian models at ESSLLI 2010. The toolbox for that course centrally included the special-purpose programming language Church. I can see now that a number of video lectures by Goodman as well as Josh Tenenbaum and others are available at the website of a 2011 UCLA summer school.

The most interesting parts of the paper are, for my purposes, section 2, 3, and 4. These are the ones which are most directly devoted to giving the reader intuitions about the ins and outs of hierarchical Bayesian models.

There, the basic idea is nicely explained with a bit of bean-bag statistics borrowed from philosopher Nelson Goodman's Fact, Fiction and Forecast (1955):
Suppose we have many bags of colored marbles and discover by drawing samples that some bags seem to have black marbles, others have white marbles, and still others have red or green marbles. Every bag is uniform in color; no bag contains marbles of more than one color. If we draw a single marble from a new bag in this population and observe a color never seen before – say, purple – it seems reasonable to expect that other draws from this same bag will also be purple. Before we started drawing from any of these bags, we had much less reason to expect that such a generalization would hold. The assumption that color is uniform within bags is a learned overhypothesis, an acquired inductive constraint. (p. 308 in the published version)
The paper appeared in a special issue of Cognition dedicated to probabilistic models of cognition. There are a number of other papers in the same issue that seem very interesting.

Wednesday, June 6, 2012

Sweetser: From etymology to pragmatics (1990), ch. 3

Chapter 3 of Eve Sweetser's insightful 1990 book contains an analysis of the English modal verbs may, must, shall, will, can, etc. She claims that their epistemic meanings are derived from a set of more basic meanings, which includes the deontic ones.

The Order of Domains

It is not yet entirely clear to me whether she thinks that the deontic meanings (e.g., You may kiss the bride) can be learned without any metaphorical scaffolding. She draws a lot of inspiration from Lenord Talmy (1988), who thinks that both deontic and epistemic thought are fueled by analogies to physical barriers, blocks, and forces. This seems to suggest the following ordering:

Physical obstacles  >  Social obstacles  >  Logical obstacles

This works pretty well for a word like may, which is historically derived from a Middle English term referring to physical strength and capability. Sweetser further cites some evidence that children learn the deontic senses before the epistemic ones (p. 50).

Messy Domain Orderings: let

But it is the first link, physical > social, which is causing me some problems. Sweetser herself gives some examples (p. 52) that potentially subvert her theory:
  • The crack in the stone let the water flow through.
  • I begged Mary to let me have another cookie.
My intuition about these examples is that there is a kind of personification going on in the first sentence: We picture the stone as an agent which may or may not give the water permission to flow. If that intuition holds, the physical meaning of let is an extension of the deontic.

As far as I know, no ancestor of the English language (that we have any written records of) contain only one of these senses. To be fair, let is, speculatively, hypothesized to be derived from a word meaning roughly "loosen." But a historical case for the derivation from physical to deontic would probably be relatively weak.

Messy Domain Orderings: have to

Another set of examples with similar problems concerns the verb have (p. 53):
  • I have to stay home, or Mom will get mad at me.
  • I have to stay home tonight to study for the test.
In at least Ronald Langacker's equally "cognitive" analysis, the verb have primarily refers to possession. Again, this sense seems to follow the Indoeuropean languages as far back as we have written sources (despite a speculative derivation from a word meaning "grasp").

How do children learn the concept of possession – always through relating it to holding something in the hand? And do they make this connection when they learn the word have? Certainly, I can have my keys in my hand, but is this more basic than I have some cash?

And more specifically, is the deontic sense of have to more basic than the sense of have as "possess"? Historically, have to seems to have been a quite late derivation which initially drew its meaning out of the idea that one can have (= possess) a duty.

If this historical trajectory has any parallel in present-day adult cognition, it would suggest that we understand I have to go by translating it into a duty, then into a possession, and then into holding something in the hand. The question is whether that operation yields anything more intelligible than the concept of necessity that we began with.