Notebooks on Language: information theory

Showing posts with label information theory. Show all posts

Tuesday, May 12, 2015

Miller: "The Magical Number Seven, Plus or Minus Two" (1956)

This classic paper was based on a lecture George Miller gave in 1955. It explains his thoughts on the limits of human information processing and the mnemonic techniques we can use to overcome them.

There are several .html and .doc transcriptions of the text available online. Scans of the original text as it appeared in the Psychological Review can also be found here, here, here, here, here, and here.

The Rubbery Transmission Rate

Miller opens the paper by considering a number of experiments that suggest that people can distinguish about four to ten objects when they only vary on a single dimension. He reports, for instance, on experiments with sounds that differ in volume, chips that differ in color, and glasses of water that differ in salt content.

Miller's Figures 1, 2, 3, and 4 (pp. 83, 83, 85, and 85)

These results suggests a deeply rooted human trait, he submits:

There seems to be some limitation built into us either by learning or by the design of our nervous systems, a limit that keeps our channel capacities in this general range. On the basis of the present evidence it seems safe to say that we possess a finite and rather small capacity for making such unidimensional judgments and that this capacity does not vary a great deal from one simple sensory attribute to another. (p. 86)

The problem with this claim is that the actual information content differs hugely depending on what kinds of items you remember: Five English words add up to 50 bits, seven digits amount to 23 bits, and, naturally, eight binary digits are 8 bits. So something is clearly going wrong with this hypothesis:

For example, decimal digits are worth 3.3 bits apiece. We can recall about seven of them, for a total of 23 bits of information. Isolated English words are worth about 10 bits apiece. If the total amount of information is to remain constant at 23 bits, then we should be able to remember only two or three words chosen at random. (p. 91)

But this uniformity is of course not what we observe.

Dits and Dots

To work around the aporia, Miller introduces an ad hoc concept:

In order to capture this distinction in somewhat picturesque terms, I have fallen into the custom of distinguishing between bits of information and chunks of information. Then I can say that the number of bits of information is constant for absolute judgment and the number of chunks of information is constant for immediate memory. The span of immediate memory seems to be almost independent of the number of bits per chunk, at least over the range that has been examined to date. (pp. 92–93)

For example:

A man just beginning to learn radio-telegraphic code hears each dit and dah as a separate chunk. Soon he is able to organize these sounds into letters and then he can deal with the letters as chunks. Then the letters organize themselves as words, which are still larger chunks, and he begins to hear whole phrases. … In the terms I am proposing to use, the operator learns to increase the bits per chunk. (p. 93)

Something for Nothing

Miller goes on to report on an experiment carried out by someone named Sidney Smith (not included in the bibliography). By teaching himself to translate binary sequences into integers, he managed to push up his memorization ability to about 40 binary digits. Miller comments:

It is a little dramatic to watch a person get 40 binary digits in a row and then repeat them back without error. However, if you think of this merely as a mnemonic trick for extending the memory span, you will miss the more important point that is implicit in nearly all such mnemonic devices. The point is that recoding is an extremely powerful weapon for increasing the amount of information that we can deal with. (pp. 94–95)

That's clearly true, but Miller gives no hint as to what the relationship is between this chunking technique and the information-theoretical concepts with which he started.

Mathematically speaking, an encoding is a probability distribution, and a recoding is just a different probability distribution. Learning something does not magically increase your probability budget; the only way you can make some codewords shorter is to make others longer. Simply mapping random binary expansions to equally random decimal expansions should not make a difference. So it's hard to see what the connection between bits and chunks should be.

Monday, March 2, 2015

Fano: The Transmission of Information (1949)

There are three different coding methods associated with Claude Shanon's name:

These are not the same. Even more confusingly, Shannon-Fano-Elias coding is in fact much closer in kind to Shannon coding than it is to Shanon-Fano coding, making the names even more misleading. Moreover, neither Fano nor Elias published any conventional research articles on the topic, and Elias may not even have invented Elias coding. I'll try to sort out these issues now.

Downwards Interval Approximations

Shannon coding is the name for the data compression algorithm that Shannon exhibited as his second and more constructive proof of the source coding theorem (his Theorem 9, page 17 in the Bell Labs transcription). It derives the code words from binary approximations to cumulative probabilities.

The process is the following:

Sort the event probabilities so that $p_1 \geq p_2 \geq \ldots \geq p_n$;
Let $F_m = \sum_{i=1}^{m-1}p_i$ be the cumulative probability of the most probable events (up to, but not including, the $m$th).
Let $K_m=\lceil{-\log_2{p_m}}\rceil$.
Let the code for event number $m$ be the first $K_m$ binary digits of $F_m$ (i.e., round $F_m$ down to the nearest multiple of $2^{-K_m}$).

As an example, with the probabilities
$$
p = (.4, .3, .2, .1),
$$Shannon coding produces the codewords
$$
(00, 01, 101, 1110).
$$Shanon coding only produces a prefix-free code when the probabilities are in fact sorted. If we think of the probabilities $p_1,p_2,\ldots,p_n$ as a stack of intervals filling out the space between 0 and 1, then the rounding-off corresponds to approximating $[F_m,F_{m+1}]$ by an interval whose left-hand boundary is a binary number smaller than $F_m$, and whose right-hand boundary is a binary number of the same precision which lies below $F_{m+1}$. If wider intervals follow narrower ones, the corresponding increase in coarseness can cause overlaps between the approximations.

Inner Interval Approximations

Shannon-Fano-Elias coding is a variation on this theme which does not require the codewords to be sorted. According to Cover and Thomas' textbook (Chapter 5), it was developed by Peter Elias but never published. However, this other textbook claims that Elias denied this, and that the idea perhaps came from a talk Shannon gave at the MIT.

At any rate, the idea behind the method is the following:

Let $F^*_m = \sum_{i=1}^{m-1}p_i+\frac{1}{2}p_m$ be the point halfway between $F_m$ and $F_{m+1}$.
Let $K_m=\lceil{-\log_2{p_m}}\rceil + 1$.
Let the code for event number $m$ be the first $K_m$ binary digits of $F^*_m$.

Since these codewords are one bit longer than the ones used in Shannon coding, we can dispense with the requirement of using sorted probabilities. Instead, we use a binary fraction that is guaranteed to lie completely inside the interval between $F_m$ and $F_{m+1}$, even if the binary expansion continues with $1111\ldots$ after our selected truncation point.

With the example above, $(.4, .3, .2, .1),$ this gives the codewords
$$
(001, 100, 1100, 11110).
$$These are exactly one bit longer than the Shannon codewords.

Recursive Batch Splitting

Shannon-Fano coding is a different approach to the problem. Instead of using binary approximation, it uses a recursive, top-down method to construct a binary tree:

If there is only one event on the list, return the empty codeword;
Otherwise, sort the event probabilities so that $p_1 \geq p_2 \geq \ldots \geq p_n$;
Of the $n-1$ possible ways of splitting this list up into a head and a tail, select the one that makes the two lists most equal in probability.
Recursively use Shannon-Fano coding to assign codewords to each of those two lists;
Prepend a 0 to all the codewords in the first list and a 1 to the codewords on the second.

For the example above, this gives
$$
(0, 10, 110, 111).
$$Notice that the codeword lengths here differ from the ones produced by the Shannon code.

Shannon-Fano coding was invented by Robert Fano at some point in the 1940s and described in his 1949 technical report "Transmission of Information" (published by the MIT). This report is hard to come by, but an akwardly transcribed .pdf without pictures is available here.

An number of excerpts and scanned figures from the report are available in these teaching materials. I'll just paste the three relevant figures here so that they don't get lost in the desert of the internet:

Saturday, July 26, 2014

Wald: Sequential Analysis (1947)

I've always thought that Shannon's insights in the 1940s papers seemed like pure magic, but this book suggests that at parts of his way of thinking were already in the air at the time: From his own frequentist perspective, Wald comes oddly close to defining a version of information theory.

The central question of the book is:

How can we devise decision procedures that map observations into {Accept, Reject, Continue} in such a way that (1) the probability of wrongly choosing Accept or Reject is low, and (2) the expected number of Continue decisions is low?

Throughout the book, the answer that he proposes is to use likelihood ratio tests. This puts him strangely close to the Bayesian tradition, including for instance Chapter 5 of Jeffreys' Theory of Probability.

A sequential binomial test ending in rejection (p. 94)

In particular, the coin flipping example that Jeffreys considers in his Chapter 5.1 is very close in spirit to the sequential binomial test that Wald considers in Chapter 5 of Sequential Analysis. However, some differences are:

Wald compares two given parameter values p₀ and p₁, while Jeffreys compares a model with a free parameter to one with a fixed value for that parameter.
Jeffreys assigns prior probabilities to everything; but Wald only uses the likelihoods given the two parameters. From his perspective, the statistical test will thus have different characteristics depending on what the underlying situation is, and he performs no averaging over these values.

This last point also means that Wald is barred from actually inventing the notion of mutual information, although he comes very close. Since he cannot take a single average over the log-likelihood ratio, he cannot compute any single statistic, but always has to bet on two horses simultanously.

Thursday, June 12, 2014

Kullback and Leibler: "On Information and Sufficiency" (1951)

This paper is mostly known for introducing the concept of divergence into statistics. It's quite notation-heavy but actually quite straightforward if you've already solved exercise 3.9 in Cover and Thomas' textbook. Everything is about log-likelihood ratios that fly off into infinity.

The paper is concerned quite a lot with the question of sufficient statistics. An interesting illustration in the last paragraph (§6) concerns the problem of how to choose the best non-sufficient statistic when we must (for some extrinsic reason).

More specifically, the set up is the following: You are observing samples from a categorical distribution with four different categories, and you want to distinguish the two following hypotheses:

x	1	2	3	4
Pr(x \| A)	.25	.25	.25	.25
Pr(x \| B)	.50	.30	.10	.10

However, you are not actually allowed to see the original data set; instead, you have to lump the categories into two groups, and you will then be told which group the observation fell. (This corresponds to using an indicator function as your statistic.)

In order to solve this exercise, they quantify the informativeness of the statistic in terms of the symmetrized divergence between the two hypotheses:

J(A, B) == D(A || B) + D(B || A).

I guess a more cautious measure would have been the minimum of the two, since this would provide a lower bound on the growth rate of the likelihood ratio whichever of the two hypotheses were true.

Using base 10 logarithms, the divergence from the uniform to the skewed distribution (A to B) is .1039, and the divergence from the skewed to the uniform (B to A) is .0947. The symmetrized divergence J(A, B) is thus about .1986, and the minimum is .0947.

Grouping the categories in various ways, we can also find the following divergences:

Grouping	D(A \|\| B)	D(A \|\| B)	J(A, B)	min
{}, {1, 2, 3, 4}	0	0	0	0
{1}, {2, 3, 4}	.5068	.0635	.1193	.0635
{2}, {1, 3, 4}	.0027	.0028	.0055	.0027
{3}, {2, 3, 4}	.0401	.0315	.0716	.0315
{4}, {2, 3, 4}	.0401	.0315	.0716	.0315
{1, 2}, {3, 4}	.0969	.0837	.1806	.0837
{1, 3}, {2, 4}	.0089	.0087	.0176	.0087
{1, 4}, {2, 3}	.0089	.0087	.0176	.0087

In the paper, Kullback and Leibler report that the best grouping is {{1}, {2, 3, 4}}, which indicates that they might have used D(A || B) as their objective function. This is a bit strange, since everything they say in the text indicates that they are thinking exclusively in terms of the symemtrized divergence J. I don't know what to conclude from that.

Wednesday, May 28, 2014

Drum: "Change, Meaning, and Information" (1957)

This 1957 article is one of those countless papers by people in linguistics trying helplessly to say something non-trivial about the relationship between information theory and linguistics. The central claim of the paper is that there are several kinds of predictability or redundancy at play in language, and that "meaning" should be seen as one of them.

More precisely, certain sentences are predictable on account of their content, so that

the "meaning"—the "what we know about it"–in a message has some direct effect upon the amount of information transmitted. (p. 166)

However, this notion is dangerously confused and conflated with the idea that more predictability equals more meaning at several points in the paper.

At any rate, Dale places semantics among several "levels" of redundancy, including a historically interesting division between "formal" and "contextual" syntax:

In the sentence, "The chair is __________," many words could be correctly used in terms of grammar; but in terms of the context of the sentence, only a very few apply. "The chair is clock" is correct grammatically, though it is nonsense in even the broadest contexts. (p. 167)

Further,

A person knowing English grammar but not word meanings, might very well write "The chair is clock," but would never do so if he knew the meanings involved. (The relationship of syntactic to contextual redundancy is much that of validity in logic to truth, incidentally). (p. 168)

A strange counterfactual, but quite revealing.

Herdan: The Advanced Theory of Language as Choice and Chance (1966)

This really odd book is based to some extent on the earlier Language as Choice and Chance (1956), but contains, according to the author, a lot of new material. It discusses a variety of topics and language statistics, often in a rather unsystematic and bizarrely directionless way.

The part about "linguistics duality" (Part IV) is particularly confusion and strange, and I'm not sure quite what to make of it. Herdan seems to want to make some great cosmic connection between quantum physics, natural language semantics, and propositional logic.

But leaving that aside, I really wanted to quote it because he so clearly expresses the deterministic philosophy of language — that speech isn't a random phenomenon, but rather has deep roots in free will.

"Human Willfulness"

He thus explains that language has one region which is outside the control of the speaker, but that once we are familiar with this set of restrictions, we can express ourselves within its bounds:

It leaves the individual free to exercise his choice in the remaining features of language, and insofar language is free. The determination of the extent to which the speaker is bound by the linguistic code he uses, and conversely, the extent to which he is free, and can be original, this is the essence of what I call quantitative linguistics. (pp. 5–6)

Similarly, he comments on a study comparing the letter distribution in two texts as follows:

There can be no doubt about the possibility of two distributions of this kind, in any language, being significantly different, if only for the simple reason that the laws of language are always subject, to some extent at least, to human willfulness or choice. By deliberately using words in one text, which happen to be of rather singular morphological structure, it may well be possible to achieve a significant difference of letter frequencies in the two texts. (p. 59)

And finally, in the chapter on the statistical analysis of style, he goes on record completely:

The deterministic view of language regards language as the deliberate choice of such linguistic units as are required for expressing the idea one has in mind. This may be said to be a definition in accordance with current views. Introspective analysis of linguistic expression would seem to show it is a deterministic process, no part of which is left to chance. A possible exception seems to be that we often use a word or expression because it 'happened' to come into our mind. But the fact that memory may have features of accidental happenings does not mean that our use of linguistic forms is of the same character. Supposing a word just happened to come into our mind while we are in the process of writing, we are still free to use it or not, and we shall do one or the other according to what is needed for expressing what we have in mind. It would seem that the cause and effect principle of physical nature has its parallel in the 'reason and consequence' or 'motive and action' principle of psychological nature, part of which is the linguistic process of giving expression to thought. Our motive for using a particular expression is that it is suited better than any other we could think of for expressing our thought. (p. 70)

He then goes on to say that "style" is a matter of self-imposed constraints, so that we are in fact a bit less free to choose our words that it appears, although we ourselves come up with the restrictions.

"Grammar Load"

In Chapter 3.3, he expands these ideas about determinism and free will in language by suggesting that we can quantify the weight of the grammatical restrictions of a language in terms of a "grammar load" statistic. He suggest that this grammar load can be assessed by counting the number of word forms per token in a sample of text from the language (p. 49). He does discuss entropy later in the book (Part III(C)), but doesn't make the connection to redundancy here.

Inspects an English corpus of 78,633 tokens, he thus finds 53,102 different forms and concludes that English has a "grammar load" of

53,102 / 78,633 = 67.53%.

Implicit in this computation is the idea that the number of new word forms grows linearly with the number of tokens you inspect. This excludes sublinear spawn rates such as

logarithmic growth, as in a Chinese Restaurant Process;
logistic growth, as in a biological system with a bounded carrying capacity;
power laws with exponents between 0 and 1 (e.g., the square root function).

In fact, vocabulary size does seem to grow sublinearly as a function of corpus size. The spawn rate for new word forms in thus not constant in English text.

Word forms in the first N tokens in the brown corpus for N = 0 … 100,000.

However, neither of the growth functions listed above fit the data very well (as far as I can see).

Monday, May 19, 2014

Stockwell: "The Transformational Model of Generative or Predictive Grammar" (1963)

This essay by Robert Stockwell is a pretty much a rehash of Chomsky's ideas from Syntactic Structures, and as such quite boringly predictable. (Stockwell also thanks Chomsky for critical comments in the acknowledgments.) However, it does touch on some interesting issues on the last couple of pages, so I'll give a couple of excerpts here.

The essay is a part of an anthology called Natural Language and the Computer, which (in addition to being quite wonderfully techno-camp) is one of the most boldly and beautifully typeset books I've seen in a while.

The title page, with gargantuan print and the editor's name in a spacy font.

Some aspects of its graphical design has an undeniable 1950s feel to it; others look like a kind of 1960s Space Odyssey futurism; and others again look like they were taken straight out of Blade Runner. It's all quite remarkable.

Title page of Part 1, with an unmistakably 1950s italic font and a "etched" figure 1.

And then of course there's the always charming prospect of reliving a lecture taking stock of the computer revolution anno 1960. One of the other books in the same series is even called – I kid you not – The Foundations of Future Eletronics. How can you not love a 1961 book with that title?

At any rate, the essay that I'm interested in plods through the familiar details of Chomskyan grammar, pushing in particular the hard sell of transformational grammar over "immediate constituent" grammar (i.e., phrase structure grammars without any postprocessing).

The last section of the essay is called "Grammaticalness and Choice" and gets a bit into the issue of how sentences are selected, and what goes on in the head of the alleged "ideal speaker" of Chomskyan linguistics. This part contains a number of interesting quotes, and I'll provide some generous extracts below.

"Forced to Operate Empirically"

The first notion taken up in the section is that of grammaticality itself, which it seems that he sees some problems making empirically precise:

Presumably the notion of "grammatical sentence" is characterized by the grammar itself, since in principle we formulate our rules in such a way as to generate only and always such sentences. It is a question of some interest whether there is a possibility of characterizing this notion independently of the grammar. It seems extremely unlikely that there is, and we will be forced to operate empirically with the machinery of the grammar, treating each sentence that it generates as a hypothesis to be tested for grammaticalness against the the reaction of native speakers. For each sentence rejected, we either revise the grammar to exclude the sentence (if we believe the rejection is on proper grounds–that is, not motivated by school grammar and the like), or we revise the grammar to generate the sentence in some special status (i.e., as only partially well formed). Each sentence accepted is, of course, a confirmation of the validity of the rules up to that point. (p. 43)

I suppose what Stockwell has in mind here is that there might in principle exists some kind of objective test of grammaticality which could relieve us of having to trust laypeople to know the difference between "ungrammatical" and "nonsensical." (If you'll allow me a bit of self-citation, I've written a short paper on the idea of having such a distinction.)

Today, linguists might fantasize about such a test taking the form of an fMRI scan; in the 1980s, they would have imagined it as an EEG; and in the 1950s, a polygraph reading. But in the absence of such a test, we are forced to use live and conscious people even though

Informant reaction is difficult to handle, because such reactions involve much more than merely the question of grammaticalness. (p. 43)

We thus only have indirect access to the alleged grammatical engine of the brain.

"The Rest Takes Care of Itself"

After briefly considering a couple of borderline grammatical cases, Stockwell continues:

One might consider the utilization of grammar by the speaker as follows. The essence of meaning is choice; every time an element is chosen in proceeding through the rules, that choice is either obligatory (in which case it was not really a choice at all, since there were no alternatives), or it is optional (in which case the choice represented simultaneously both the positive decision and the rejection of all alternatives–the meaning of the choice inheres in the [sic] constrastive value of the chosen element as compared with all the possible choices that were rejected). (p. 44)

Oddly enough, Stockwell's meditation on the actual role and implementation of Chomskyan grammar in a person's behavior brings him around to confirming not only de Saussure's picture of meaning, but also Shannon's. I wonder whether he is aware of the implications of this.

He then goes on to consider an example:

Thus these are the choices involved in a simple sentences such

Did the boy leave.

NP + VP Obligatory

D + N Obligatory

D == the Optional

N == boy Optional

aux + VP₁ Obligatory

aux == past Optional

VP₁ == V_i Optional

V_i == leave Optional

T_intrg Optional

Inversion of Te Obligatory

Empty carrier for past Obligatory

Rising intonation Obligatory

Of the twelve choices, half are obligatory–either initiating the derivation, or following out obligatory consequences of optional choices. The additional rules of the phonetic component are nearly all obligatory. To include these would increase the obligatory choices to about twice the number of optional choices. In fact it is quite probable that in real discourse even the element the is obligatory (that is, the choice of the versus a seems quite predictable in a larger context). This would leave us with only five meaning-carrying (optional) choices. Everything else that goes into making up the sentence is in a valid sense perfectly mechanical, perfectly automatic. It can be argued that a grammar must maximize the obligatory elements and cut the optional choice to the barest minimum in order to get any reasonable understanding of how the human brain is capable of following complex discourse at all. That is, the hearer's attention is focused on matching, with his own generating machinery, the sequence of optional choices; since he has identical obligatory machinery, the rest takes care of itself. In this way, the same grammatical model accounts for both encoding and decoding. We do not need separate and distinct analogs for both sending and receiving messages. (pp. 44–45)

Again, this is oddly similar to the kind of generative model one would employ in information theory, and the notion of having a sparse language to minimize cognitive effort here takes the place of error-correction. But presumably, the philosophical difference is whether we need a source model (of the "optional choices") or only a channel model (of the "perfectly mechanical, perfectly automatic" choices).

"From Which He Knowingly Deviates"

This reading of the differences is backed up by his elaboration:

Encoding and decoding does not imply that a speaker of hearer proceeds step by step in any literal sense through the choices characterized by the grammar in order to produce or understand sentences. The capacities characterized by the grammar are but one contributing factor of undetermined extent in the total performance of the user of language. The grammar enumerates only well-formed sentences and deviant sentences, which, recognized as ranging from slightly to extremely deviant by the language user, are interpreted somehow by comparison with well-formed ones. The grammar enumerates sentences at random; it does not select, as the user does, just that sentences appropriate to a context. The grammar clacks relentlessly through the possible choices; the user starts, restarts, jumps the grammatical traces, trails off. A generative grammar is not a speaker-writer analog. It is a logical analog of the regularities to which the language user conforms or from which he knowingly deviates. (p. 45)

I take it that this entails that grammars are essentially and necessarily logical in nature, since their purpose is to describe the set of available sentences of the language rather than to predict their occurrence. From such a perspective, a probabilistic context-free grammar would be something of an oxymoron.

A logical and a probabilistic conception of grammar.

"A Natural Extension of Scholarly Grammar"

Again in perfectly orthodox fashion, Stockwell finally tips his hat at the impossibility of formulating discovery procedures and makes a strange claim about the machine-unfriendly nature of transformation grammars:

Although the title of this book suggests the machine processing of natural-language data, it should not be assumed that the transformational model of the structure of language is in any way geared to machines of any kind, either historically or in current development. On the contrary, it is a natural extension of traditional scholarly grammar, an effort to make explicit the regularities to which speakers of a language conform, which has been the focus of grammatical studies for over 2,500 years. The effort to formulate discovery procedures for the systematic inference of grammatical structure is quite recent; few if any transformationalists believe such a goal has any possibility of success–at least, not until much more is known about the kinds of regularities which grammarians seek to discover and to formulate in explicit terms. (p. 45)

That seems a bit odd in the light of, e.g., Victor Yngve's recollection of how people were swayed by Chomsky's grammatical analyses because they "read like computer programs."

Saturday, March 1, 2014

Attneave: Applications of Information Theory to Psychology (1959)

Fred Attneave's book on information theory and psychology is a sober and careful overview of the various ways in which information theory had been applied to psychology (by people like George Miller) by 1959.

Attneave explicitly tries to stay clear of the information theory craze which followed the publication of Shannon's 1948 paper:

Book cover; from Amazon.

Thus presented with a shiny new tool kit and a somewhat esoteric new vocabulary to go with it, more than a few psychologists reacted with an excess of enthusiasm. During the early fifties some of the attempts to apply informational techniques to psychological problems were successful and illuminating, some were pointless, and some were downright bizarre. At present two generalizations may be stated with considerable confidence:
(1) Information theory is not going to provide a ready-made solution to all psychological problems; (2) Employed with intelligence, flexibility, and critical insight, information theory can have great value both in the formulation or certain psychological problems and in the analysis of certain psychological data (pp. v–vi)

Or in other words: Information theory can provide the descriptive statistics, but there is no hiding from the fact that you and you alone are responsible for your model.

Language Only, Please

Chapter 2 of the book is about entropy rates, and about the entropy of English in particular. Attneave talks about various estimation methods, and he discusses Shannon's guessing game and a couple of related studies.

As he sums up the various mathematical estimation tricks, he notes that predictions from statistical tables tend to be more reliable than predictions from human subjects with respect to the first couple of letters of a text. This means that estimates from human predictions will tend to overestimate the unpredictability of the first few letters of a string.

He then comments:

What we are concerned with above is the obvious possibility that calculated values (or rather, brackets) of H_N [= the entropy of letter N given letter 1 through N – 1] will be too high because of the subject's incomplete appreciation of statistical regularities which are objectively present. On the other hand, there is the less obvious possibility that a subject's guesses may, in a certain sense, be too good. Shannon's intent is presumably to study statistical restraints which pertain to language. But a subject given a long sequence of letters which he has probably never encountered before, in that exact pattern, may be expected to base his prediction of the next letter not only upon language statistics, but also upon his general knowledge [p. 40] of the world to which language refers. A possible reply to this criticism is that all but the lowest orders of sequential dependency in language are in any case attributable to natural connections among the referents of words, and that it is entirely legitimate for a human predictor to take advantage of of such natural connections to estimate transitional probabilities of language, even when no empirical frequencies corresponding to the probabilities exist. It is nevertheless important to realize that a human predictor is conceivably superior to a hypothetical "ideal predictor" who knows none of the connections between words and their referents, but who (with unlimited computational facilities) has analyzed all the English ever written and discovered all the statistical regularities residing therein. (pp. 39–40; emphases in original)

I'm not sure that was "Shannon's intent." Attneave seems to rely crucially on an objective interpretation of probability as well as an a priori belief in language as an autonomous object.

Just like Laplace's philosophical commitments became obvious when he starting talking in hypothetical terms, it is also the "ideal predictor" in this quote which reveals the philosophy of language that informs Attneave's perspective.

Wednesday, March 20, 2013

Bernardo: "Expected Information as Expected Utility" (1979)

Following a suggestion by Dennis Lindley (1956), this paper suggests that the appropriate measure of the expected value of an experiment X with respect to a target parameter Θ is the mutual information between the two, that is,

I(X;Θ) = H(Θ) – H(X,Θ).

Bernado calls this the "expected useful information" contained in the experiment (§2).

Proper Scoring Rules

The paper also contains a uniqueness theorem about so-called proper scoring rules (§3–4).

A "scoring rule" is a scheme for rewarding an agent (the "scientist") who reports probability distributions to you. It may depend on the distribution and on the actual observed outcome. For instance, a feasible rule is to pay the scientist p(x) dollars for the density function p if in the event that x occurred.

That function, however, would under many common rationality assumptions give the scientist an incentive to misreport his or her actual probability estimates. We consequently define a "proper" scoring rule as one that is hack-proof in the sense that the best course of action under that rule is to report your actual probability estimates.

An example of a proper scoring rule is –log p(x), but apparently, there are others. Barnardo refers to Robert Buehler amd I. J. Good's papers in Foundations of Statistical Inference (1971) for further examples. Unfortunately, that book seems to be a bit difficult to get a hold of.

Nice and Proper Scoring Rules

The theorem that Bernardo proves is the following: The only proper scoring rule which are both smooth and local (as defined below) are functions of the logarithmic form

u(p,x) = a log p(x) + b(x)

where a is a constant, and b is a real-valued function on the sample space.

As a corollary, the scientist's optimal expected payoff is H(X) + B/a, where B is the average value of the function b under the scientist's subjective probabilities. It also follows that the the optimal course of action for the scientist under this scheme will be to provide the maximum amount of information that is consistent with his or her beliefs.

So what does "smooth" and "local" mean?

Bernardo doesn't define "smooth," but usually in real analysis, a smooth function is one that can be differentiated indefinitely often. However, Bernardo refers to the physics textbook by Harold and Bertha Jeffreys (1972) for a definition. I don't know if they use the word the same way.

A scoring rule u is "local" if the reward that the scientist receives in event the event of x only depends on x and on the probability that he or she assigned to x. In other words, a local scoring rule u can be rewritten in terms of a function v whose first argument is a probability rather than a probability distribution:

u(p,x) = v(w,x),

where w = q(x), the reported probabilityof x (which does not necessarily equal the actual subjective probability p).

How To Prove This Theorem

I haven't thought too hard about the proof, but here's the gist that I got out of it: First, you use the the method of Lagrange multipliers to show that the w-derivative of the function

∫ v(w,x) p(x) dx – λ ( ∫ w dx – 1)

is zero for all w when a function q is optimal. You then conclude that q = p fulfills this condition, since u was assumed to be a proper scoring rule. You then have a differential equation on your hands, and you go on to discover that its only solutions are of the postulated form.

Thursday, March 14, 2013

Sereno, O'Donnell, and Rayner: "Eye Movements and Lexical Ambiguity Resolution" (2006)

In the literature on word comprehension, some studies have found that people usually take quite a long time looking at an ambiguous word if it occurs in a context that strongly favors one of its less frequent meanings.

This paper raises the issue of whether this is mainly because of clash between the high contextual fit and the low frequency, or mainly because of the frequency.

The Needle-in-a-Haystack Effect

A context preceding a word can either be neutral or biased, and a meaning of an ambiguous word can either be dominant (more frequent) or subordinate (less frequent). When a biased context favors the subordinate meaning, it is called a subordinate-biasing context.

The subordinate-bias effect is the phenomenon that people spend more time looking at an ambiguous word in a subordinate-biasing context than they take looking at an unambiguous word in the same context — given that the two words have the same frequency.

For instance, the word port can mean either "harbor" or "sweet wine," but the former is much more frequent than the latter. In this case, the subordinate-biasing effect is that people take longer to read the sentence

I decided to drink a glass of port

than the sentence

I decided to drink a glass of beer

This is true even though the words port and beer have almost equal frequencies (in the BNC, there are 3691 vs. 3179 occurrences of port vs. beer, respectively).

Balanced Meaning Frequencies = Balanced Reading Time

The question is whether these absolute word frequencies are the right thing to count, and Sereno, O'Donnell, and Rayner argue that they aren't. Instead, they suggest that it would be more fair to compare the sentence

I decided to drink a glass of port

to the sentence

I decided to drink a glass of rum

This is because port occurs in the meaning "sweet wine" approximately as often as the word rum occurs in absolute terms — i.e., much more rarely than beer. (A casual inspection of the frequencies of the phrases drink port/rum and a glass of port/rum seem to confirm the close match.)

What the Measurements Say

This means that you get three relevant conditions:

one in which the target word is ambiguous, and in which its intended meaning is not the most frequent one;
one in which the target word has the same absolute frequency as the ambiguous word;
and one in which the target word has the same absolute frequency as the intended meaning of the ambiguous word.

Each of these are then associated with an average reading time:

It's not like the effect is overwhelming, but here's what you see: The easiest thing to read is a high-frequent word with only a single meaning (middle row); the most difficult thing to read is a low-frequent word with only a single meaning (top row).

Between these two things in terms of reading time, you find the ambiguous word whose meaning was consistent with the context, but whose absolute frequency was higher.

Why are Ambiguous Words Easier?

In the conclusion of the paper, Sereno, O'Donnell, and Rayner speculate a bit about the possible causes of this "reverse subordinate-biasing effect," but they don't seem to find an explanation they are happy about (p. 345).

It seems to me that one would have to look closer at the sentences to find the correct answer. For instance, consider the following incomplete sentence:

She spent hours organizing the information on the computer into a _________

If you had to bet, how much money would you put on table, paper, and graph, respectively? If you would put more money on table than on graph, that probably also means that you were already anticipating seeing the word table in its "figure" meaning when your eyes reached the blank in the end of the sentence.

If people in general have such informed expectations, then that would explain why they are faster at retrieving the correct meaning of the anticipated word than they are at comprehending an unexpected word. But checking whether this is in fact the case would require a more careful information-theoretic study of the materials used in the experiment.

Monday, December 17, 2012

Wrighton: Elementary Principles of Probability and Information (1973)

This is a strange little book. It is essentially an introduction to information theory, but with a bunch of digressions into all sorts of philosophical and methodological issues. It's not always completely clear to me where Wrighton is going with his arguments, but some of them are quite thought-provoking.

The Reshuffled Hierarchy of Science

The book begins with a discussion of what probability is, using the philosophy of Giambattista Vico as a starting point.

A core claim of Vico's philosophy is, according to Wrighton, that the sciences should not be sorted on a scale with mathematics and theoretical physics in one end, and social and human science in the other. Rather, one should categorize them according to the artificiality of their objects:

Mathematics retains a special position, since in mathematics Man creates the object of his study, which therefore he wholly understands. Likewise, Man may hope to acquire an understanding of comparable depth within the humanities; for he has created his own history, his own language and his own social environment. The Natural Sciences suffer a demotion, relative to the ideal of certainty, and revert to the status which Bacon accorded them; experiment and observation provide, and must always provide the true foundation of physics, since, as Vico puts it, Man can never wholly understand what God has created; physical theory primarily reflects, or epitomises, man's increasing control over his physical environment, rather than an independent reality. (p. 3)

A crude way of putting the same point would be that math is a human science. Whatever is conceptual, mental, or cultural falls on one end of of the scale, and the study of natural phenomena falls on the other.

Probability as Deliberately Created Uncertainty

Once we have reshuffled the sciences like this, we have to decide whether we categorize probability theory as a study of "Man's creation" or of "God's creation." Here, Wrighton squarely comes down on the side of the first team:

It is sometimes said to be a mere empirical fact that when a coin is tossed it comes down heads with probability one half; and that a suitably-devised machine could toss coins so that they came down heads every time. The suggestion is based on a total misconception. A coin comes down heads with probability on half because we arrange that it should do so: the empirically-verifiable consequences cannot be other than they are. If an operator, within the terms of our instructions to him, were to train himself to toss a coin so that it always came down heads, we should have to regard our instructions as misconceived, and would either have to raise the minimum angular momentum assigned or supply him with a smaller or lighter coin: it is a matter of making the task implicitly assigned to the operator sufficiently difficult. Thus we cannot think of a random event without somehow involving a human being in its generation. (p. 3; his emphasis)

So "real" probability is "artificial" probability; a random experiment is an experiment that allows us to say that something went wrong if it is predictable. Only metaphorically can we transfer this artificially created complexity to natural systems.

Points of Contact

I find this idea interesting for two reasons:

First, it turns information theory upside-down so that system complexity becomes more fundamental than probability. This is an interesting idea also championed by Edwin Jaynes, and which I have been exposed to through Flemming Topsøe.

And second, it relates the philosophical problems of probability theory to the thorny issues surrounding the notions of repetition, identity, rule-following, and induction. It is probability fair to say that one can't solve any of these problems without solving the others as well.

Thursday, May 31, 2012

Janssen: "Independent Choices and the Interpretation of IF Logic" (2002)

In this paper, Theo Janssen argues that Jaako Hintikka and Gabriel Sandu's notion of independence-friendly logic does not adequately formalize the notion of quantifier independence, at least according to his intuitions about what independence should mean. He has essentially two arguments, of which the first one is the strongest.

Dependence Chains

The most serious problem that Janssen points out is that IF logic may require that a choice A is independent from a choice C without ruling out that there is some intermediate choice B such that A depends on B, and B depends on C.

Such cases create some quite strange examples, e.g.:

TRUE: ∀x: (x ≠ 2) v (∃u_/x: x = u).
FALSE: ∀x: (x = 2) v (∃u_/x: x = u).
TRUE: ∀x: (x ≠ 2) v (∃u_/x: x ≠ u).
TRUE: ∀x: (x = 2) v (∃u_/x: x ≠ u).

The true sentence are here true in spite of the independence between u and x, the reason being that the disjunction is not independent of x. The Verifier can thus circumvent the quantifier independence. For instance, in the first sentence, he can set u := 2 and then pick the left disjunct if and only if x = 2.

Similar examples exists where the middle term which smuggles in the dependence is not a disjunction, but another quantifier.

Naming Problems

Another problem occurs, according to Janssen, when "a variable is bound within the scope of a quantifier that binds the same variable" (p. 375). This occurs for instance in sentence like

∀x∃x: R(x,x).

He claims that such sentences come about by "classically allowed" substitutions from, in this case,

∀x∃y: R(x,y).

After such a substitution, an indirect dependencies referring to the value of y might be lost, and an otherwise winning Verifier strategy might be broken. However, I don't know whether there would be any problem with just banning double-bound quantifiers as the x above; it doesn't seem to have any necessary or positive effect.

Solutions

To avoid the problems of Hintikka's system, Janssen defines a new game with explicit extra conditions such as "The strategy does not have variables in W as arguments" and "If the values of variables in W are changed, and there is a winning choice, then the same choice is a step towards winning" (p. 382).

This solves the problem, but doesn't bring about much transparency, it seems to me. A better solution would probably be to describe the instantiation of the quantifiers and the selection of branches at the connectives as a probability distribution on a suitable power of the domain and of the branch options {L,R}. Then independence could be clearly described as statistical independence.

Such a system would require the domains to be finite, which is not good. However, within finite domains, results about logical strength of solution concepts would be easy to extract, because they would simply correspond to different constraints on the dependencies between the choices, i.e., marginal distributions. It would, in fact, allow us to quantify the amount of information that was transmitted from one choice to another by computing the mutual information between two marginal distributions.

Friday, May 4, 2012

Topsøe: "Game Theoretical optimization inspired by information theory" (2009)

I finally got around to reading Flemming's recent paper on information theory. It's a short introduction to his and Peter Harremoës' approach to the topic, which involves a game-theoretical interpretation of coding.

The Nature-Observer Game

The basic theoretical framework is the following: Nature and Observer play a zero-sum game. Nature picks a world from the set X, and Observer picks an action from the set Y. For each strategy profile (x,y), three quantities are then defined:

H(x), the entropy of x
D(x,y), the (Kullback-Leibler) divergence of x from y
Φ(x,y), the complexity of the pair (x,y)

The three quantities are related through the equality

Complexity = Entropy + Divergence

which also implies that the smallest complexity that Observer can achieve in a world x is the entropy H(x).

The point of the game is for nature to produce maximal complexity, and for Observer to produce minimal complexity. By Von Neumann and Morgenstern's minimax theorem, this means that a strategy profile (x*,y*) is an equilibrium for the game if the following three quantities coincide (p. 559):

sup_x inf_y Φ(x,y) = Φ(x*,y*) = inf_y sup_x Φ(x,y)

The leftmost term here designates the optimal outcome for Nature (highest complexity given averse reponses), while the rightmost term designates the optimal outcome for Observer (lowest complexity given averse responses).

Notice that inf_y Φ(x,y) = H(x), since D(x,y) is assumed to be positive. This means that Nature in effect can be seen as an entropy-maximizer when playing optimally. Further, Topsøe defines R(y) = sup_x Φ(x,y) to be the risk associated with a given action y, so Observer can then be described as a risk-minimizer.

Information Transmission

The paper also defines a notion of information transmission rate (p. 556), but I am not quite sure about its applications. But this is the idea behind the concept, in bullet-point form:

Assume that α is a (prior) probability distribution on the set of worlds.
Construct an expected case scenario by summing all worlds with the probabilities of α as weights.
Let y be a best response to this expected case.
For each world x, let the surprisal of x be the divergence between x and y. This can be seen as a motivated quantitative measure of how different x is from the expected case.
Find the weighted average of the surprisals, using the probabilities of α as weights.

This average surprise level is the information transmission rate.

Notice that if there is a single action which is an optimal response to all worlds, then the information transmission rate is 0. This reflects the fact that the state of the world would not inform the actions of Observer at all in such a situation. He would simply use the same strategy whatever happened.

Reversely, if the information transmission rate is very high, this means that an insensitive, on-average response pattern will be costly for the Observer (in terms of complexity). If we reinterpret the action of choosing a best response to a certain world as an act of inferring the state of the world (describing it correctly), then the surprisal can further be seen as Observer's ability to distinguish states of the world. On such an interpretation, the best-response function can be seen as a signaling function in a Bayesian game.

Thursday, February 2, 2012

Cacciari and Tabossi: "The Comprehension of Idioms" (1988)

In this paper, Cristina Cacciari and Patrizia Tabossi argue that our idiomatic reading of a phrase like shoot the breeze is triggered by a "key." The process by which we understand such phrases can consequently differ depending on the position of the key.

Although they don't give any examples, they probably have string like take the bull..., a silver spoon..., and make a fool... in mind when they talk about "keys."

Experimental Evidence

They support this theory with three priming experiments. These experiments show that:

When the key is highly predictive of the idiom, and it occurs early in the phrase, the idiom is quickly and automatically activated, but the literal word meanings are not.
When the key is more ambiguous or occurs late (or both), the opposite occurs: The idiom is not activated, but the word meanings are. However, this presupposes that the decision task that tests the activation levels is posed immediately after the stimulus.
If one inserts a 300 ms delay before the decision task, though, both the idiom and the word meanings become active. This seems to suggest that the initial ambiguity associated with the late/weak key is resolved after the delay.

Recognizing Idioms From "Keys"

Cacciari and Tabossi don't formalize their notion of a "key," but it seems that it might be done in statistical terms. To illustrate that, take for instance the following idiomatic sentence:

It is best to take the bull by the horns

This sentence contains a number of incomplete left-segments:

it ...
it is ...
its best ...
it is best to ... (etc.)

As we consider bigger and bigger left-segments, it becomes more and more probable that these incomplete sentences are going to be completed as the full sentence it is best to take the bull by the horns. This can be estimated by looking at the number of occurrences of the segment vs. whole sentence.

By thus comparing the predictive strength of the left-segment, we can get a sense of where the the uncertainty tips into certainty:

The numbers behind this graph are based on Google searches. For instance, the bar above the word bull shows the number of hits for it is best to take the bull divided by the number of hits for the whole sentence. The 'keyness' is just the difference between this statistic for the relevant word and its neighbor to the left.

As the picture clearly shows, the big jump occurs with the word bull. So we can expect an average English speaker to switch into "idiom mode" upon reaching that point in the sentence.

Conditional Probability Is Not Redundancy

Note this is different from the redundancy of each completion.

That would be another relevant statistic to base the concept of "keyness" on, but it would also require an estimate of the number of completions of each of these left-segments and their probabilities.

That's not a completely crazy thing to try to estimate based on some kind of language model, but it does require some corpus data that I don't have at hand right now.

Opaque Idioms

I just want to give a list of some of the Italian idioms that Cacciari and Tabossi provide, because they are such great examples of how much uncertainty we face when interpreting unknown idioms. Here's a list of their most striking items:

He was born with the shirt (= born with a silver spoon in his mouth)
That would have done him the skin (= killed him)
He made a whole in the water (= did not succeed)
She did a job with the feet (= badly)
The project has gone to the mountain (= failed)
The assets have been given bottom (= been depleted)
She had the moon (= was in a bad mood)
He was at the green (= broke)
He made himself in four to succeed (= tried hard)
She was left of salt (= struck dumb)

Monday, November 7, 2011

Sardinha: "Metaphor probabilities in corpora" (2008)

In his contribution to Confronting Metaphor in Use, Tony Sardinha argues that metaphor researchers should care more about the probabilities that a given word will be used metaphorically in a given corpus genre.

To illustrate his ideas, he reports a large number of metaphor probabilities taken from a highly specialized corpus of Brazilian Portuguese.

The article cites three interesting sources that are quite alien to metaphor theory proper:

The anthology Probabilistic Linguistics (2003), edited by Rens Bod, Jennifer Hay, and Stefanie Jannedy.
A number of articles from the early 1990s by the British linguist Michael A. K. Halliday, a veteran advocate for corpus linguistics.
Michael Hoey: "The textual priming of lexis" (2004); this paper is interesting for having a bottom-up, construction-grammar style of explanation.

Subscribe to: Posts ( Atom )

NP + VP	Obligatory
D + N	Obligatory
D == the	Optional
N == boy	Optional
aux + VP₁	Obligatory
aux == *past*	Optional
VP₁ == V_i	Optional
V_i == leave	Optional
T_intrg	Optional
Inversion of Te	Obligatory
Empty carrier for past	Obligatory
Rising intonation	Obligatory