Friday, May 30, 2014

Fisher: "On the Mathematical Foundations of Theoretical Statistics" (1921)

I haven't had time to study this paper in detail yet, but based on a quick skim, it seems that Fisher
I'll read the whole thing later. But for now, a few quotes.

First, a centerpiece in Bayes' original paper was the postulate that the uncertainty about the bias of a coin should be represented by means of a uniform distribution. Fisher comments:
The postulate would, if true, be of great importance in bringing an immense variety of questions within the domain of probability. It is, however, evidently extremely arbitrary. Apart from evolving a vitally important piece of knowledge, that of the exact form of the distribution of values of p, out of an assumption of complete ignorance, it is not even a unique solution. (p. 325)
Second Bayesian topic is ratio tests: That is, assigning probabilities to two exclusive and exhaustive hypotheses X and Y based on the ratio between how well they explain the data set A, that is,
Fisher in 1931; image from the National Portrait Gallery.
Pr(A | X) / Pr(A | Y).
Fisher comments:
This amounts to assuming that before A was observed, it was known that our universe had been selected at random for [= from] an infinite population in which X was true in one half, and Y true in the other half. Clearly such an assumption is entirely arbitrary, nor has any method been put forward by which such assumptions can be made even with consistent uniqueness. (p. 326)
The introduction of the likelihood concept:
There would be no need to emphasise the baseless character of the assumptions made under the titles of inverse probability and BAYES' Theorem in view of the decisive criticism to which they have been exposed at the hands of BOOLE, VENN, and CHRYSTAL, were it not for the fact that the older writers, such as LAPLACE and POISSON, who accepted these assumptions, also laid the foundations of the modern theory of statistics, and have introduced into their discussions of this subject ideas of a similar character. I must indeed plead guilty in my original statement of the Method of the Maximum Likelihood (9) to having based my argument upon the principle of inverse probability; in the same paper, it is true, I emphasised the fact that such inverse probabilities were relative only. That is to say, that while we might speak of one value of p as having an inverse probability three times that of another value of p, we might on no account introduce the differential element dp, so as to be able to say that it was three times as probable that p should lie in one rather than the other of two equal elements. Upon consideration, therefore, I perceive that the word probability is wrongly used in such a connection: probability is a ratio of frequencies, and about the frequencies of such values we can know nothing whatever. We must return to the actual fact that one value of p, of the frequency of which we know nothing, would yield the observed result three times as frequently as would another value of p. If we need a word to characterise this relative property of different values of p, I suggest that we may speak without confusion of the likelihood of one value of p being thrice the likelihood of another, bearing always in mind that likelihood is not here used loosely as a synonym of probability, but simply to express the relative frequencies with which such values of the hypothetical quantity p would in fact yield the observed sample. (p. 326)
In the conclusion, he says that likelihood and probability are "two radically distinct concepts, both of importance in influencing our judgment," but "confused under the single name of probability" (p. 367) Note that these concepts are "influencing our judgment" — that is, they are not just computational methods for making a decision, but rather a kind of model of a rational mind.

Wednesday, May 28, 2014

Chomsky: "Three Models for the Description of Language" (1956)

This is my favorite Chomsky text, perhaps after Syntactic Structures. It contains a comparison of finite-state, phrase-structure, and context-sensitive languages; it also suggests that a transformational theory is the most illuminating generative story for English sentences.

A structural ambiguity; p. 118.

Garden Varieties

Among other things, the paper contains the following examples of formal languages (p. 115):
  • "Mirror-image" sentences: aa, bb, abba, baab, aabbaa, …
  • Echo sentences: aa, bb, abab, baba, aabaab, …
  • Counting sentences: ab, aabb, aaabbb, …
The counting language is also used to show that the set of phrase-structure languages is a proper subset of the set of context-sensitive languages (p. 119).

A Markov model hidden states (and thus arbitrary dependence lengths); p. 116.

Irrelevant to Grammar

The paper also contains the familiar jump from a rejection of Markov models to a rejection of statistical models at large:
Whatever the other interest of statistical approximation in this sense may be, it is clear that it can shed no light on the problems of grammar. There is no general relation between the frequency of a string (or its component parts) and its grammaticalness. We can see this most clearly by considering such strings as
(14) colorless green ideas sleep furiously
which is a grammatical sentence, even though it is fair to assume that no pair of its words may ever have occurred together in the past. (p. 116)
Thus,
there is no significant correlation between order of approximation and grammaticalness. If we order the strings of a given length in terms of order of approximation to English, we shall find both grammatical and ungrammatical strings scattered throughout the list, from top to bottom. Hence the notion of statistical approximation appears to be irrelevant to grammar. (p. 116)
I suppose "order of approximation" here means "probability" rather literally the "order of the Markov model" (otherwise the this assertion doesn't make much sense).

Drum: "Change, Meaning, and Information" (1957)

This 1957 article is one of those countless papers by people in linguistics trying helplessly to say something non-trivial about the relationship between information theory and linguistics. The central claim of the paper is that there are several kinds of predictability or redundancy at play in language, and that "meaning" should be seen as one of them.

More precisely, certain sentences are predictable on account of their content, so that
the "meaning"—the "what we know about it"–in a message has some direct effect upon the amount of information transmitted. (p. 166)
However, this notion is dangerously confused and conflated with the idea that more predictability equals more meaning at several points in the paper.

At any rate, Dale places semantics among several "levels" of redundancy, including a historically interesting division between "formal" and "contextual" syntax:
In the sentence, "The chair is __________," many words could be correctly used in terms of grammar; but in terms of the context of the sentence, only a very few apply. "The chair is clock" is correct grammatically, though it is nonsense in even the broadest contexts. (p. 167)
Further,
A person knowing English grammar but not word meanings, might very well write "The chair is clock," but would never do so if he knew the meanings involved. (The relationship of syntactic to contextual redundancy is much that of validity in logic to truth, incidentally). (p. 168)
A strange counterfactual, but quite revealing.

Herdan: The Advanced Theory of Language as Choice and Chance (1966)

This really odd book is based to some extent on the earlier Language as Choice and Chance (1956), but contains, according to the author, a lot of new material. It discusses a variety of topics and language statistics, often in a rather unsystematic and bizarrely directionless way.

The part about "linguistics duality" (Part IV) is particularly confusion and strange, and I'm not sure quite what to make of it. Herdan seems to want to make some great cosmic connection between quantum physics, natural language semantics, and propositional logic.

But leaving that aside, I really wanted to quote it because he so clearly expresses the deterministic philosophy of language — that speech isn't a random phenomenon, but rather has deep roots in free will.

"Human Willfulness"

He thus explains that language has one region which is outside the control of the speaker, but that once we are familiar with this set of restrictions, we can express ourselves within its bounds:
It leaves the individual free to exercise his choice in the remaining features of language, and insofar language is free. The determination of the extent to which the speaker is bound by the linguistic code he uses, and conversely, the extent to which he is free, and can be original, this is the essence of what I call quantitative linguistics. (pp. 5–6)
Similarly, he comments on a study comparing the letter distribution in two texts as follows:
There can be no doubt about the possibility of two distributions of this kind, in any language, being significantly different, if only for the simple reason that the laws of language are always subject, to some extent at least, to human willfulness or choice. By deliberately using words in one text, which happen to be of rather singular morphological structure, it may well be possible to achieve a significant difference of letter frequencies in the two texts. (p. 59)
And finally, in the chapter on the statistical analysis of style, he goes on record completely:
The deterministic view of language regards language as the deliberate choice of such linguistic units as are required for expressing the idea one has in mind. This may be said to be a definition in accordance with current views. Introspective analysis of linguistic expression would seem to show it is a deterministic process, no part of which is left to chance. A possible exception seems to be that we often use a word or expression because it 'happened' to come into our mind. But the fact that memory may have features of accidental happenings does not mean that our use of linguistic forms is of the same character. Supposing a word just happened to come into our mind while we are in the process of writing, we are still free to use it or not, and we shall do one or the other according to what is needed for expressing what we have in mind. It would seem that the cause and effect principle of physical nature has its parallel in the 'reason and consequence' or 'motive and action' principle of psychological nature, part of which is the linguistic process of giving expression to thought. Our motive for using a particular expression is that it is suited better than any other we could think of for expressing our thought. (p. 70)
He then goes on to say that "style" is a matter of self-imposed constraints, so that we are in fact a bit less free to choose our words that it appears, although we ourselves come up with the restrictions.

"Grammar Load"

In Chapter 3.3, he expands these ideas about determinism and free will in language by suggesting that we can quantify the weight of the grammatical restrictions of a language in terms of a "grammar load" statistic. He suggest that this grammar load can be assessed by counting the number of word forms per token in a sample of text from the language (p. 49). He does discuss entropy later in the book (Part III(C)), but doesn't make the connection to redundancy here.

Inspects an English corpus of 78,633 tokens, he thus finds 53,102 different forms and concludes that English has a "grammar load" of
53,102 / 78,633 = 67.53%.
Implicit in this computation is the idea that the number of new word forms grows linearly with the number of tokens you inspect. This excludes sublinear spawn rates such as
In fact, vocabulary size does seem to grow sublinearly as a function of corpus size. The spawn rate for new word forms in thus not constant in English text.

Word forms in the first N tokens in the brown corpus for N = 0 … 100,000.

However, neither of the growth functions listed above fit the data very well (as far as I can see).

Tuesday, May 27, 2014

Arnauld: Logic (1662), Part IV, chs. 13–16

19th-century portrait of Arnauld, the main author of the Logic.
The last four chapters of the Port-Royal Logic deal with reasoning under uncertainty. They touch briefly on issues of evidential support, balanced odds, and fair games.

In retrospect, these remarks are difficult not to read as a precursors of later probability theory. Many authors have pointed this out, including Ian Hacking and Lorraine Daston.

I'm reading the 1996 translation by Jill Buroker, published by Cambridge UP.

The content of the four chapters on probability, or plausibility, are:
  • Chapter 13, "Some Rules for directing reason well in beliefs about events that depend on human faith," argues that the standard of geometric proof and mathematical certainty only applies to matters of "immutable essence," and that judgment about human affairs should be made by meditating on the available evidence and testimonies.
  • Chapter 14, "Application of the preceding rule to the beliefs about miracles," claims that this means that there are several instances in which it is reasonable to believe miracles took place, even if this cannot be proven beyond all reasonable doubt.
  • Chapter 15, "Another remark on the same subject of beliefs about events," adds that this explains why certain deeds are attested by two notaries, and that the technique of weighing the evidence also has applications to authorship attribution for ancient (religious) manuscripts.
  • Chapter 16, "The Judgments we ought to make concerning future events," argues that the doctrine not only applies to reasoning about the past, but also about the prediction of the future; a number of fair and unfair games are described as examples, and a version of Pascal's wager is then put forward.
Quotes follow below.


Don't Hold Your Breath

In chapter 13, we are told that the methodology of geometric proof works for geometry,
But if we try to use the same rules for beliefs about human events, we will always judge them falsely, except by chance, and we will make a thousand fallacious inferences about them. (p. 263)
Instead, we have to look for circumstantial evidence:
In order to decide the truth about an event and to determine whether or not to believe in it, we must not consider it nakedly and in itself, as we would a proposition of geometry. But we must pay attention to all the accompanying circumstances, internal as well as external. I call those circumstances internal that belong to the fact itself, and those external that concern the persons whose testimony leads us to believe in it. Given this attention, if all the circumstances are such that it never or only rarely happens that similar circumstances are consistent with the falsity of the belief, the mind is naturally led to think that it is true. Moreover, it is right to do so, above all in the conduct of life, which does not require greater certainty than moral certainty, and which even ought to be satisfied in many cases with the greatest probability.
But if, on the contrary, these circumstances are such that they are often consistent with the falsity of the belief, reason would require either that we remain in suspense, or that we view as false whatever we are told when its truth does not look likely, even if it does not look completely impossible. (p. 264)

Cloister of the Hôpital Cochin, inhabiting the former site of the Port-Royal abbey.

 

A Perversion of Reason

This idea is echoed in chapter 15:
Since we should be satisfied with moral certainty in matters not susceptible of metaphysical certainty, so too when we cannot have complete moral certainty, the best we can do when we are committed to taking sides is to embrace the most probable, since it would be a perversion of reason to embrace the less probable. (p. 270)
We also hear that negative evidence "weaken or destroy in the mind the grounds for belief" (p. 270).

A couple of things worth noting about these quotes:
  • The process described here is a mental therapy, not a decision calculus; you pay close attention, and your "mind is naturally led" to a certain belief.
  • The focus is on human affairs, that is, on practical matters.
  • As in frequentist statistics, the contrast is not between true and false, but between confirmed and unconfirmed; we thus "remain in suspense" if the evidence is insufficient.

Set the Record Straight

Chapter 16 pours scorn on "many people" for entertaining a certain "illusion":
This is that they consider only the greatness and importance of the benefit they desire or he disadvantage they fear, without considering in any way the likelihood or probability that this benefit or disadvantage will or will not come about. (p. 273)
That is, thinking only of utility while ignoring probability. Moreover,
This is what attracts so many people to lotteries: Is it not highly advantageous, they say, to win twenty thousand crowns for one crown? Each person thinks he will be the happy person who will win the jackpot. No one reflects that if it is, for example, twenty thousand crows, it may be thirty thousand times more probable for each individual to lose rather than to win it.
The flaw in this reasoning is that in order to decide what we ought to do to obtain some good or avoid some harm, it is necessary to consider not only the good or harm itself, but also the probability that it will or will not occur, and to view geometrically the proportion all these things have when taken together. This can be clarified by the following example.
There are game in which, if ten persons each put in a crown, only one wins the whole pot and all the others lose. This each person risks losing only a crown and may win nine. If we consider only the gain and loss in themselves, it would appear that each person has the advantage. But we must consider in addition that if each could win nine crowns and risks losing only one, it is also nine times more probability for each person to lose one crown and not to win nine. Hence each has nine crowns to hope for himself, one crown to lose, nine degrees of probability of losing a crown, and only one of winning the nine crowns. This puts the matter at perfect equality. (pp. 273–74)
This argument is then pushed a little further to deal with some more extreme bets; to the fear of lightning, which is allegedly irrational and has to be "set straight" (p. 275); and finally to a version of Pascal's wager.

Monday, May 26, 2014

Diogenes: Lives of the Eminent Philosophers, Vol. I

Socrates; statue at the Louvre.from Wikimedia.
How did Socrates pay his bills?

Most of our sources tell us that he didn't accept fees from his students. Hanging around the market place can't have generated too much income either. So where did the money come from?

There is a couple of potential answer's in Diogenes Laërtius' book. He provides a number of details about the mundane existence of many of the ancient philosophers, including their social standing and moneyed patrons.

No Need

Somewhat surprisingly, the most direct statement we get from Diogenes is that Socrates supported himself through usury:
Aristoxenus, the son of Spintharus, says of him that he made money; he would at all events invest sums, collect the interest accruing, and then, when this was expended, put out the principal again. (II. 20)
This implies that he should possess some measure of capital to lend; but Diogenes also emphasizes how simply and modestly he lived. Several pages are dedicated to discussing his ascetic lifestyle, and we hear that
He prided himself on his plain living, and never asked a fee from anyone. He used to say that he most enjoyed the food which was least in need of condiment, and the drink which made him feel the least hankering for some other drink; and that he was nearest to the gods in that he had the fewest wants. (II. 27)
But apparently, this notion that he did not charge for his teaching is contradicted by the portrayal in the Clouds of Aristophanes. (I haven't read the play, though.)

Also, Diogenes reports the following:
Aeschines said to him, "I am a poor man and have nothing else to give, but I offer you myself," and Socrates answered, "Nay, do you not see that you are offering me the greatest gift of all?" (II. 34)
Does this mean that one would have to apologize to Socrates for not paying tuition? Or apologize because one would in normal circumstances be expected to?

No Thanks

A recurring theme in Diogenes' narratives is Socrates being offered but rejecting gifts from rich and prominent people, often with a snappy one-liner:
Pamphila in the seventh book of her Commentaries tells how Alcibiades once offered him a large site on which to build a house; but he replied, "Suppose, then, I wanted shoes and you offered me a whole hide to make a pair with, would it not be ridiculous for me to take it?" (II. 24)
He showed his contempt for Archelaus of Macedon and Scopas of Cranon and Eurylochus of Larissa by refusing to accept their presents or to go to their court. (II. 25)
Again, when Charmides offered him some slaves in order that he derive an income from them, he declined the offer; (II. 31)
Apollodorus [the playwright?] offered him a beautiful garment to die in: "What," said he, "is my own good enough to live in but not to die in?" (II. 35)
Taken at face value, this means Socrates didn't actually enjoy any of these gifts; but at the same time, it indicates that there was a culture of patronage in Athens, with aristocrats and moneyed classes seemingly supporting their own favourite intellectuals.

Archelaus, king of Macedonia, 413–399; from Wikimedia.

But Yes

But a counterpoint to these anecdotes comes up in Diogenes' biography of Aristippus, the much more extravagant student of Socrates who was known for using perfume, eating luxurious foods, keeping multiple courtesans, etc.:
To the accusation that, although he was a pupil of Socrates, he took fees, his rejoinder was, "Most certainly I do, for Socrates, too, when certain people sent him corn and wine, used to take a little and return all the rest; and he had the most foremost men in Athens for his stewards, whereas mine is my slave Eutychides." (II. 74)
I presume the last sentence is a kind of cynical put-down, intended to excuse his own comfortable life as well as de-glamorize the ascetic life of Socrates by comparing his following to a staff of slaves; but I don't know for sure.

We also get the following little gem of an exchange illustrating just how much Aristippus differed from Socrates:
When he had made some money by teaching, Socrates asked him, "Where did you get so much?" to which he replied, "Where you got so little." (II. 80)
Pretty snappy too, it would thus appear.

All the Sweeter

Let's just stay with Aristippus for a bit. He offers such an illuminating contrast through which to understand the life and thought of Socrates. For instance,
Aristippus; apparently an 18th-century print.
… he was the first of the followers of Socrates to charge fees and to send money to his master. And on one occasion, the sum of twenty minae which he had sent was returned to him, Socrates declaring that the supernatural sign would not let him take it; the very offer, in fact, annoyed him. (II. 65)
Not surprisingly, he had little rapport with Diogenes the Cynic:
Diogenes, washing the dirt from his vegetables, saw him passing and jeered at him in these terms, "If you had learnt to make these your diet, you would not have paid court to kings," to which his rejoinder was, "And if you knew how to associate with men, you would not be washing vegetables." (II. 68)
Unfortunately, this anecdote is also attributed to a meeting between the Cynic philosopher Metrocles and the later philosopher Theodorus.

Money Not Books

Three other equally illuminating examples mentioned by Diogenes are:
When some one brought his son as a pupil, he asked a fee of 500 drachmae. The father objected, "For that sum I can buy a slave." "Then do so," was the reply, "and you will have two." (II. 72)
To one who reproached him with extravagance in catering, he replied, "Wouldn't you have bought this if you could have got it for three obols [= a tiny sum]?" The answer being in the affirmative, "Very well, then," said Aristippus, "I am no longer a lover of pleasure, it is you who are a lover of money." (II. 75)
16th-century French woodcut of
Dionysius; from Wikimedia.
He received a sum of money from Dionysius [the tyrant of Syracuse, his patron] at the same time that Plato carried off a book and, when he was twitted with this, his reply was, "Well, I want money, Plato wants books." (II. 81)
Finally, we get the following quite repulsive story about his incessant association with prostitutes:
A courtesan having told him that she was with child by him, he replied, "You are no more sure of this than if, after running through coarse rushes, you were to say you had been pricked by one in particular." Someone accused him of exposing his son as if it was not his offspring. Whereupon he replied, "Phlegm, too, and vermin we know to be of our own begetting, but for all that, because they are useless, we cast them as far from us as possible." (II. 81)

Some Indirect Evidence

While Diogenes gives us relatively little to go by with respect to Socrates, he does provide a few more hints with respect to some of the other Academic philosophers. For instance:
  • Plato had two different estates, a quite impressive amount of silver, and apparently four household slaves; (III. 41–42)
  • Arcesilaus, a later head of the Academy, "had a property in Pitane from which his brother Pylades sent him supplies." (IV. 38)
  • Aristotle was hired as a teacher for the young prince Alexander during the reign of Philip of Macedon (V. 4). Alexander was fifteen at the time, if I understand the chronology correctly (V. 10).
  • After he became king, Alexander apparently sent Xenocrates, then head of the Academy, "a large sum of money," and "he took three thousand Attic drachmas and sent back the rest to Alexander, whose needs, he said, were greater than his own, because he had a greater number of people to keep" (IV. 8).
  • About forty or fifty years prior to that, the Macedonian general Ptolemy Soter had made a similar proposition to the philosopher Stilpo of Megara: "Ptolemy Soter, they say, made much of him, and when he had got possession of Megara, offered him a sum of money and invited him to return with him to Egypt. But Stilpo would only accept a very moderate sum, and he declined the proposed journey, and removed to Aegina until Ptolemy set sail" (II. 115).
As with Socrates, there thus seems to be a pattern of rich, brutal rulers trying to ingratiate themselves with the intelligentsia, and that professional class then trying to hold on to some of that sweet, sweet money without building up too much of a dependency.

The best map of ancient Greece I could find; from Encyclopedia Britannica Kids.

Monday, May 19, 2014

Stockwell: "The Transformational Model of Generative or Predictive Grammar" (1963)

This essay by Robert Stockwell is a pretty much a rehash of Chomsky's ideas from Syntactic Structures, and as such quite boringly predictable. (Stockwell also thanks Chomsky for critical comments in the acknowledgments.) However, it does touch on some interesting issues on the last couple of pages, so I'll give a couple of excerpts here.

The essay is a part of an anthology called Natural Language and the Computer, which (in addition to being quite wonderfully techno-camp) is one of the most boldly and beautifully typeset books I've seen in a while.

The title page, with gargantuan print and the editor's name in a spacy font.

Some aspects of its graphical design has an undeniable 1950s feel to it; others look like a kind of 1960s Space Odyssey futurism; and others again look like they were taken straight out of Blade Runner. It's all quite remarkable.

Title page of Part 1, with an unmistakably 1950s italic font and a "etched" figure 1.

And then of course there's the always charming prospect of reliving a lecture taking stock of the computer revolution anno 1960. One of the other books in the same series is even called – I kid you not – The Foundations of Future Eletronics. How can you not love a 1961 book with that title?

At any rate, the essay that I'm interested in plods through the familiar details of Chomskyan grammar, pushing in particular the hard sell of transformational grammar over "immediate constituent" grammar (i.e., phrase structure grammars without any postprocessing).

The last section of the essay is called "Grammaticalness and Choice" and gets a bit into the issue of how sentences are selected, and what goes on in the head of the alleged "ideal speaker" of Chomskyan linguistics. This part contains a number of interesting quotes, and I'll provide some generous extracts below.

"Forced to Operate Empirically"

The first notion taken up in the section is that of grammaticality itself, which it seems that he sees some problems making empirically precise:
Presumably the notion of "grammatical sentence" is characterized by the grammar itself, since in principle we formulate our rules in such a way as to generate only and always such sentences. It is a question of some interest whether there is a possibility of characterizing this notion independently of the grammar. It seems extremely unlikely that there is, and we will be forced to operate empirically with the machinery of the grammar, treating each sentence that it generates as a hypothesis to be tested for grammaticalness against the the reaction of native speakers. For each sentence rejected, we either revise the grammar to exclude the sentence (if we believe the rejection is on proper grounds–that is, not motivated by school grammar and the like), or we revise the grammar to generate the sentence in some special status (i.e., as only partially well formed). Each sentence accepted is, of course, a confirmation of the validity of the rules up to that point. (p. 43)
I suppose what Stockwell has in mind here is that there might in principle exists some kind of objective test of grammaticality which could relieve us of having to trust laypeople to know the difference between "ungrammatical" and "nonsensical." (If you'll allow me a bit of self-citation, I've written a short paper on the idea of having such a distinction.)

Today, linguists might fantasize about such a test taking the form of an fMRI scan; in the 1980s, they would have imagined it as an EEG; and in the 1950s, a polygraph reading. But in the absence of such a test, we are forced to use live and conscious people even though
Informant reaction is difficult to handle, because such reactions involve much more than merely the question of grammaticalness. (p. 43)
We thus only have indirect access to the alleged grammatical engine of the brain.

"The Rest Takes Care of Itself"

After briefly considering a couple of borderline grammatical cases, Stockwell continues:
One might consider the utilization of grammar by the speaker as follows. The essence of meaning is choice; every time an element is chosen in proceeding through the rules, that choice is either obligatory (in which case it was not really a choice at all, since there were no alternatives), or it is optional (in which case the choice represented simultaneously both the positive decision and the rejection of all alternatives–the meaning of the choice inheres in the [sic] constrastive value of the chosen element as compared with all the possible choices that were rejected). (p. 44)
Oddly enough, Stockwell's meditation on the actual role and implementation of Chomskyan grammar in a person's behavior brings him around to confirming not only de Saussure's picture of meaning, but also Shannon's. I wonder whether he is aware of the implications of this.

He then goes on to consider an example:
Thus these are the choices involved in a simple sentences such
 Did the boy leave.
NP + VPObligatory
D + NObligatory
D == theOptional
N == boyOptional
aux + VP1Obligatory
aux == pastOptional
VP1 == ViOptional
Vi == leaveOptional
TintrgOptional
Inversion of TeObligatory
Empty carrier for pastObligatory
Rising intonationObligatory
Of the twelve choices, half are obligatory–either initiating the derivation, or following out obligatory consequences of optional choices. The additional rules of the phonetic component are nearly all obligatory. To include these would increase the obligatory choices to about twice the number of optional choices. In fact it is quite probable that in real discourse even the element the is obligatory (that is, the choice of the versus a seems quite predictable in a larger context). This would leave us with only five meaning-carrying (optional) choices. Everything else that goes into making up the sentence is in a valid sense perfectly mechanical, perfectly automatic. It can be argued that a grammar must maximize the obligatory elements and cut the optional choice to the barest minimum in order to get any reasonable understanding of how the human brain is capable of following complex discourse at all. That is, the hearer's attention is focused on matching, with his own generating machinery, the sequence of optional choices; since he has identical obligatory machinery, the rest takes care of itself. In this way, the same grammatical model accounts for both encoding and decoding. We do not need separate and distinct analogs for both sending and receiving messages. (pp. 44–45)
Again, this is oddly similar to the kind of generative model one would employ in information theory, and the notion of having a sparse language to minimize cognitive effort here takes the place of error-correction. But presumably, the philosophical difference is whether we need a source model (of the "optional choices") or only a channel model (of the "perfectly mechanical, perfectly automatic" choices).

"From Which He Knowingly Deviates"

This reading of the differences is backed up by his elaboration:
Encoding and decoding does not imply that a speaker of hearer proceeds step by step in any literal sense through the choices characterized by the grammar in order to produce or understand sentences. The capacities characterized by the grammar are but one contributing factor of undetermined extent in the total performance of the user of language. The grammar enumerates only well-formed sentences and deviant sentences, which, recognized as ranging from slightly to extremely deviant by the language user, are interpreted somehow by comparison with well-formed ones. The grammar enumerates sentences at random; it does not select, as the user does, just that sentences appropriate to a context. The grammar clacks relentlessly through the possible choices; the user starts, restarts, jumps the grammatical traces, trails off. A generative grammar is not a speaker-writer analog. It is a logical analog of the regularities to which the language user conforms or from which he knowingly deviates. (p. 45)
I take it that this entails that grammars are essentially and necessarily logical in nature, since their purpose is to describe the set of available sentences of the language rather than to predict their occurrence. From such a perspective, a probabilistic context-free grammar would be something of an oxymoron.

A logical and a probabilistic conception of grammar.
 

"A Natural Extension of Scholarly Grammar"

Again in perfectly orthodox fashion, Stockwell finally tips his hat at the impossibility of formulating discovery procedures and makes a strange claim about the machine-unfriendly nature of transformation grammars:
Although the title of this book suggests the machine processing of natural-language data, it should not be assumed that the transformational model of the structure of language is in any way geared to machines of any kind, either historically or in current development. On the contrary, it is a natural extension of traditional scholarly grammar, an effort to make explicit the regularities to which speakers of a language conform, which has been the focus of grammatical studies for over 2,500 years. The effort to formulate discovery procedures for the systematic inference of grammatical structure is quite recent; few if any transformationalists believe such a goal has any possibility of success–at least, not until much more is known about the kinds of regularities which grammarians seek to discover and to formulate in explicit terms. (p. 45)
That seems a bit odd in the light of, e.g., Victor Yngve's recollection of how people were swayed by Chomsky's grammatical analyses because they "read like computer programs."

Sunday, May 18, 2014

Descartes: Meditations on First Philosophy (1641)

Descartes; image from Wikimedia.

The work of René Descartes was unlike its contemporary counterparts in combining ancient Christian meditative practices with more modern academic philosophy. This has a number peculiar consequences, such as his insistence on monkish seclusion in combination with his preference for the newly emerged geometric conception of natural philosophy. The Meditations shows both of these underlying philosophies at work.

The Smell of Money

One funny aspect of both the Meditations and the Discourse on Method is how explicitly he ties the practice of philosophy to the practical reality of being of the moneyed classes:
To-day, then since very opportunely for the plan I have in view I have delivered my mind from every case and since I have procured for myself an assured leisure in a peacable retirement, I shall at last seriously and freely address myself to the general upheaval of all my former opinions. (I, pp. 45–46)
It comes up as well as a backdrop for other passages, as when he discusses the possibility that his senses might deceive him about seemingly obvious facts:
For example, there is the fact that I am here, seated by the fire, attired in a dressing gown, having this paper in my hands and other similar matters. And how could I deny that these hands and this body are mine, were it not perhaps that I compare myself to certain persons, devoid of sense, whose cerebrella are so troubled and clouded by the violent vapours of black bile , that they constantly assure us that they think they are kings when they are really quite poor, or that they are clothed in purple when they are really without covering, or who imagine that they have an earthenware head or are nothing but pumpkins or are made of glass. But they are mad, and I should not be any less insane were I to follow examples so extravagant. (I, p. 46)
The very first kinds of madness he alludes to are thus economic megalomania. By contrast, our hero is seated comfortably at the fire in the center of the bourgeois universe, enjoying his dignified leisure.
 

Spiritual Exercise

The purposes of these meditations is to retract or "withhold assent" about uncertain things, as he declared in the first meditation (p. 46). This involves a quite radically world-renouncing and quite literally meditative move:
I shall now close my eyes, I shall stop my ears, I shall call away my senses, I shall efface even from my thoughts all the images of corporeal things, or at least (for that is hardly possible) I shall esteem them as vain and false; and thus holding converse only with myself and considering my own nature, I shall try little by little to reach a better knowledge of myself. (III, p. 58)
In the "Preface to the Reader," he makes it clear that he expects us, his future readers, to use the book as a guide on a similar spiritual journey:
… I should never advise anyone to read it [= this book] excepting those who desire to meditate seriously with me, and who can detach their minds from affairs of sense, and deliver themselves entirely from every sort of prejudice. (Preface, p. 40)
As a corollary, we get of course a highly mentalistic concept of human beings and of the soul:
But what then am I? A thing which thinks. What is a thing which thinks? It is a thing which doubts, understands, affirms, denies, wills, refuses, which also imagines and feels. (II, p. 54)
Gone is thus the "talking animal," not to mention the walking one.

Title page; also from Wikimedia.

Mnemotechnics

Another aspect of the meditative practice is that it literally is intended to change his attitudes by means of repeated appreciation of certain facts:
… although I notice a certain weakness in my nature in that I cannot continually concentrate my mind on one single thought, I can yet, by attentive and frequently repeated meditation meditation, impress it so forcibly on my memory that I shall never fail to recollect it whenever I have need of it, and thus acquire a habit of never going astray. (IV, p. 79)
This notion of meditation as a kind of doxastic therapy has also come up previously, when he discusses the idea that his mind is known more immediate than the wax in his hand:
But because it is difficult to rid oneself so promptly of an opinion which one was accustomed to for so long, it will be well that I should halt a little at this point, so that by the length of my meditation I may more deeply imprint on my memory this new knowledge. (II, p.58)
Again, this should remind us of earlier textbooks in logic recommending frequent memory training and the like as a means for improving the soul.

Friday, May 9, 2014

Zabell: "R. A. Fisher and the Fiducial Argument" (1992)

Chapter 3.3 of Fisher's 1956 book is dedicated to his so-called "Fiducial Argument."

I was extremely confused by his presentation and looked around for some secondary literature. This brought me to this wonderful paper by Sandy Zabell, which explains how Fisher's ideas about fiducial inference were indeed quite confused and changed a lot over time. It also explains the core of his argument better than he did himself (to my mind, at least).

As I now understand Fisher's argument, this is the idea: When you have a statistical model with a flat prior, the posterior probability of a specific parameter setting is proportional to the likelihood of the data under that parameter setting,

Pr(p | x) ∝ Pr(x | p)

However, for many unbounded parameter spaces, the likelihood Pr(X = x | p) does not have a finite integral when considered as a function of p. In such cases, a straightforward use of Bayesian inference with a flat prior is not an option.

But, Fisher says, consider instead the cumulative likelihood Pr(X < x | p). This is a function of x which always lies between 0 and 1, and it is 0 at negative infinity and 1 at positive infinity.

Pr(X < x | p) for uniform distributions with right end-points p = 3, p = 5, and p = 7.

The trick now is to view this cumulative likelihood as a function of p instead of a function of x. In many but not all cases, this function will be 1 when the parameter is at negative infinity and 0 when it is at positive infinity.

Cumulative likelihood at x = 1.5, x = 2.5, and p = 3.5 as a function of the parameter.

For instance, if p is the mean of a normal distribution, the cumulative likelihood Pr(X < x | p) decreases in this way. The reason is that the upper bound x stays where it is, while the expected value of the variable increases.

Uniform likelihoods Pr(X < x | p) with varying right end-points p.

In such cases, we can thus interpret the cumulative likelihood as the complement of a CDF for the parameter,
G(p) = 1 – Pr(X < x | p).
If this function G is differentiable, we can further interpret G' as a PDF for the parameter p given observation x.

As an example, suppose (as on the pictures) that a number X is drawn from a uniform distribution on the interval [0, p]. The cumulative likelihood is then
Pr(X < x | p) = x/p    (0 < x < p),
and 0 and 1 below and above the interval, respectively.

Fiducial PDFs given the observations x = 1.5, x = 2.5, and x = 3.5.

Considering the complement of this function, 1 – x/p, as a CDF for the parameter p, we can differentiate it in order to get the density
G'(p) = x/p2    (x < p),
when p > x and 0 otherwise. We have thus obtained a posterior probability distribution for the parameter without assuming anything about the prior.

It should be noted that
  • This method does not always work; consider for instance the case in which the cumulative likelihood oscillates between a unimodal normal and a bimodal normal distribution as the parameters runs along the real number line.
  • The method can also give inconsistent results; for instance, the fiducial distribution of X2 is, as far as I understand, not necessarily the distribution you would get by finding the fiducial distribution of X and then deriving a distribution for X2.
  • The method has no single, natural extension to the multi-parameter case, and there are some serious obstacles to constructing such an extension.
It is also interesting that in the example above, the fiducial distribution corresponds to the posterior you get if you assume the improper prior 1/p. It can thus not be rationalized as a posterior inference using only ordinary probability distributions, but it can if we allow ourselves crazy, unnormalizable ones.

Tuesday, May 6, 2014

de Finetti: Probability, Induction, and Statistics (1972)

In Chapters 8 and 9 of this anthology, Bruno de Finetti reiterates his reasons for espousing Bayesian probability theory as the unique optimal calculus of reasoning. This brings him into a discussion of several controversies surrounding the two paradigms of statistics.

Bruno de Finetti and a computer; image from www.moebiusonline.eu.

No Unknown Unknowns

According to de Finetti, the ordinary meaning of the word "probability" is "a degree of belief" (p. 148), and he rejects any attempt to define it in terms of frequency:
… we reject the idea that the ostensible notion of identical events or trials gives a suitable basis for an empirical formulation of a frequentist theory of probability or for some objectivistic form of the "law of large numbers". (p. 154)
Consequently:
The probability of an event conditional on, or in the light of, a specified result is a different probability, not a better evaluation of the original probability. (p. 149)
There is thus no such things as an "unknown probability." You always know your own uncertainty:
Any assertion concerning probabilities of events is merely the expression of somebody's opinion and not itself an event. There is no meaning, therefore, in asking whether such an assertion is true or false or more or less probable. (p. 189)
Thus, "speaking of unknown probabilities must be forbidden as meaningless" (p. 190) and in fact rejected as a "superstition" (p. 154–55).

But of course we do have problems assigning numbers of things, so de Finetti has some explaining to do. He thus invokes the analogy of choosing a price for a commodity:
A personal probability is, in effect, a quantitative decision closely akin to deciding on a price. In seeking to fix such a number with precision the person will sooner or later encounter difficulties that evoke the expressions "vagueness", "insecurity", or "vacillation". Analysis of this omnipresent phenomenon has given rise to misunderstandings. Thus, attempts to say that the exact probabilities are "meaningless" or "non-existent" pose more severe problems than they are intended to resolve, similarly for replacements of individual probabilities by intervals or by second-order probabilities. […] Sight should not be lost of the the fact that a person may find himself in an economic situation that entails acting in accordance with a sharply defined probability, whether the person chooses his act with security or not. (p. 145)
In spite of this seeming pluralism about personal opinion, he still maintains that the mathematical concept of probability is an idealization:
The (subjectivistic) theory of probability is a normative theory (p. 151).
But of course, the latter refers only to the mechanics of the calculus, not the choice of priors.

Rants Against Frequentism

De Finetti hates frequentist statistics. In his brief historical sketch, he says that the frequentist theory is a set of "substitutes" for Bayesian reasoning which were supposed to fill the "void" left after the analysis by Bayes was rejected (p. 161).

He adds:
The method pursued in the construction of such substitutes consists in general of adopting or imitating some case where the correct method reduces to a simple form based on summarizing parameters, however substituting for the true formulation and justification some incomplete and fragmentary justification or even no justification at all, as comes to seem legitimate when each notion is interpreted as something autonomous and arbitrary. For each isolated problem it appeared thus legitimate to devise as many ad hoc expedients as desired, and in fact it often happens that several are devised, proposed, and applied, to a single problem. (p. 161)
Shortly after, another rant follows:
In this manner, any notion of a systematic and meaningful interpretation of the problem of statistical inference is abandoned for the position of devising, case by case, "tests" of hypotheses or methods of "estimating" parameters. This means formulating, as an autonomous and largely arbitrary question, the problem of extracting from experience something that is apparently to be employed as though it were a conclusion or conviction, while asserting that it is neither one nor the other. (p. 162)
He is specifically angry about the "grossly inconsistent" notion of tests and hypothesis rejections, which he finds to be perverse distortions of the proper use of Bayes' rule (p. 163):
The severest of these mutilations is that of the oversimplified criteria according to which a probability P(E | H) us taken as a basis for rejecting the isolated hypothesis H if this probability, for the observation E, is small. (p. 163)
Such hypothesis rejection are, namely, ambiguous about what event E the observed data actually testifies to, as in the problem of choosing between one-sided and two-sided tests:
If, for example, as is often the case, E consists in having observed the exact value x of a random number X, such a deviation, the probability of that exact value is ordinarily zero. In order to eliminate the evident meaninglessness of this criterion that rejects the hypothesis no matter what value x may have, some other is substituted for it, such as observation of a value equal to or greater than x in absolute value, or equal or greater in absolute value and of the same sign. But all these variables are arbitrary, at least in the framework of so crudely mutilated a formulation. (p. 163)
On the following page, he also gives the example of having to decide whether a point on a target was hit by a particular marksman. He gives various examples of sets that such a point can belong to: The singleton set containing only the point itself, a circle having the point as a center, a slice of the target containing the point, a circle having the center of the target as its center, etc.

Various ways of construing the acceptance region for a test.

He continues to say that "One might say that all the deficiencies of objectivistic statistics stem from insistence on using only what appears to be soundly based" (p. 165). This, he says, is like setting a price according to the things that are easiest to measure rather than the things that are most relevant.

The issue of building a statistical enterprise on likelihoods alone is, he contends, like a systematic attempt to find P(E | H) when you are looking for P(H | E). In an example he attributes to Halphen:
We need a cement that will not be harmed by water. The merchant advises us to buy a certain kind that, he assures us, will not harm water. He does not try to cheat us by saying that the two things are equivalent but he want to convince us not to insist on asking for what we need (p. 173).
This is apparently a commentary on related example used by Neyman.

De Finetti on Wald

In a series of papers from the 1940s and 50s, Abraham Wald developed a theory of "admissible decision functions" for decision problems with uncertainty (see, e.g., here). His idea was to consider a decision admissible if it minimized the maximal damage that could obtain in the given situation. This correspond to the solution of a two-person zero-sum game against a malevolent nature.

In his discussion of Wald's theory, de Finetti helpfully "completes" the specification of a decision problem by putting a prior probability on the various hypotheses. Having provided these marginal probabilities, he comments:
Of course, these marginal elements do not appear in Wald's formulation; their absence there is just what prevents the problem of decision from having the solution that is obvious when the table is thus completed. Namely, choose the decision corresponding to the minimal(mean) loss, or equivalently to the maximal(mean) gain or the maximal (mean) utility. Here we have always put "mean" between parentheses but from now on shall suppress the word altogether; for value and utility in an uncertain situation is, by definition, the mathematical expectation of the values of utilities. (p. 179)
In his own work, Wald concluded that the admissible strategies are the mixed strategies whose support consists of pure strategies that are optimal for some parameter setting. But these are also the ones that can be rationalized by some prior probability distribution, so de Finetti happily concludes that
… the admissible decisions are the Bayesian ones; that is, those that minimize the loss with respect to some evaluation of the [prior probabilities]. (p. 181).
Abraham Wald; image from Wikimedia.
Having thus turned Wald into a closet Bayesian, de Finetti only needs to object a bit to the distribution-free worst-case reasoning that Wald applied in order to reach his conclusion:
Wald did not explicitly recognize the rule of the probability evaluation in induction and, even more, he seemed inclined to emphasize everywhere the application of the minimax principle, which is reasonable only in strategic situations (like the zero-sum-two-person case in the theory of games) or under such a superstition as that of a "malevolent nature". In spite of its shortcomings, Wald's formulation avoids the narrow interpretation of decisions as acceptance of hypotheses, and offers freedom to choose the proper decision according to a not yet openly recognized prior opinion. (p. 183)
He also later criticizes the minimax solutions on the grounds that "their initial assumptions seem rather arbitrary and artificial" (p. 198). He thus notes:
If the subjectivistic formulation were to lead to conclusion diverging from the objectivistic ones, opposition would be understandable; but the conclusions are the same. Among the admissible rules, the objectivistic theory requires that one be chosen arbitrarily, and it cannot give any criterion of preference; the subjectivistic theory does the same but explains each possible choice as corresponding to a suitable initial opinion. Why then reject this compelling unification? (p. 185)
This is, I think, quite crude, and also misses the essential concern about statistical consistency which plays such a large role in frequentist reasoning, and which has no place in Bayesian reasoning, where all priors are considered equal. Another way of saying this is that Wald would have worried as much about the admissible priors as he worried about the admissible decisions if he had turned Bayesian. A foundation for statistical reasoning cannot itself be statistical.