Tuesday, December 16, 2014

The Dialogues of Saint Gregory the Great, book II, ch. 4

How do your cure an attention deficit in the sixth century? By magic, it seems.

In his biography of Saint Benedict, Pope Gregory I relates a story of how he "cured a monk that had an idle and wandering mind." Here it is:
In one of the monasteries which he had built in those parts, a monk there was, which could not continue at prayers; for when the other monks knelt down to serve God, his manner was to go forth, and there with wandering mind to busy himself about some earthly and transitory things. And when he had been often by his Abbot admonished of this fault without any amendment, at length he was sent to the man of God, who did likewise very much rebuke him for his folly; yet notwithstanding, returning back again, he did scarce two days follow the holy man's admonition; for, upon the third day, he fell again to his old custom, and would not abide within at the time of prayer: word whereof being once more sent to the man of God, by the father of the Abbey whom he had there appointed, he returned him answer that he would come himself, and reform what was amiss, which he did accordingly: and it fell so out that when the singing of psalms was ended, and the hour come in which the monks betook themselves to prayer, the holy man perceived that the monk, which used at that time to go forth, was by a little black boy drawn out by the skirt of his garment; upon which sight, he spake secretly to Pompeianus, father of the Abbey, and also Maurus, saying : "Do you not see who it is, that draweth this monk from his prayers?" and they answered him, that they did not. "Then let us pray," quoth he, "unto God, that you also may behold whom this monk doth follow": and after two days Maurus did see him, but Pompeianus could not. Upon another day, when the man of God had ended his devotions, he went out of the oratory, where he found the foresaid monk standing idle, whom for the blindness of his heart he strake with a little wand, and from that day forward he was so freed from all allurement of the little black boy, that he remained quietly at his prayers, as other of the monks did : for the old enemy was so terrified, that he durst not any more suggest such cogitations : as though that blow, not the monk, but himself had been strooken. (pp. 61-62)
This is remarkably un-psychological: It seems like the only way these people were able to explain lack of distraction was by externalizing the problem in a devil or spirit, or as here, in the "allurement of the little black boy."

And of course, a problem so concrete should be treated with a good dose of Harry Potter-style spell-casting. Unless, that is, the wand is a club, and the spell is a beating.

Monday, December 8, 2014

Edwards: Likelihood (1972)

Edwards (from his Cambridge site)
The geneticist A. F. W. Edwards is a (now retired) professor of biometry who was massively influenced by Ronald Fisher in his scientific writings. His books Likelihood argues that the likelihood concept is the only sound basis for scientific inference, but it reads at times almost like one long rant against Bayesian statistics (particularly ch. 4) and Neyman-Pearson theory (particularly ch. 9).

Don't Do Probs

As an alternative to these approaches to statistics, Edwards proposes that we limit ourselves to makes assertions only in terms of likelihood, "support" (log-likelihood, p.12), and likelihood ratios. In the brief epilogue of the book, he states that this
… allows us to do most of the things which we want to do, whilst restraining us from doing some things which, perhaps, we should not do. (p. 212)
In particular, this approach emphatically prohibits the comparison of hypotheses in probabilitistic terms. The kind of uncertainty we have about scientific theories is simply not, Edwards states, of a nature that can be quantified in terms of probabilities: "The beliefs are of a different kind," and they are "not commensurate" (p. 53)

The Difference Between Bad and Worse

He briefly mentions Ramsey and his Dutch book-style argument for the calculus of probability, and then goes on to speculate that, had not died so young,
… perhaps he would have argued that his demonstration that absolute degrees of belief in propositions must, for consistency's sake, obey the law of probability, did not compel anyone to apply such a theory to scientific hypotheses. Should they decline to do so (as I do), then they might consider a theory of relative degrees of belief, such as likelihood supplies. (p. 28)
In other words, it might be true that you cannot assign numbers to propositions in any other way than according to the calculus of probabilities, but you can always reject to have a quantitative opinion in the first place (or not make a bet).

Nulls Only

Consistently with Fisher's approach to statistics, Edwards finds it important to distinguish between null and not-null hypotheses: That is, in opposition to Neyman-Pearson theory, he refuses to explicitly formulate the alternative hypothesis against which a chance hypothesis is tested.

Here as elsewhere, this is a serious limitation with quite profound consequences:
It should be noted that the class of hypotheses we call 'statistical' is not necessarily closed with respect to the logical operations of alternation ('or') and negation ('not'). For a hypothesis resulting from either of these operations is likely to be composite, and composite hypotheses do not have well-defined statistical consequences, because the probabilities of occurrence of the component simple hypotheses are undefined. For example, if $p$ is the parameter of a binomial model, about which inferences are to be made from some particular binomial results, '$p=\frac{1}{2}$' is a statistical hypothesis because its consequences are well-defined in probability terms, but its negation, '$p\neq\frac{1}{2}$', is not a statistical hypothesis, its consequences being ill-defined. Similarly, '$p=\frac{1}{4}$ or $p=\frac{1}{2}$' is not a statistical hypothesis, except in the trivial case of each simple hypothesis having identical consequences. (p. 5)
This should also be contrasted with Jeffreys' approach, in which the alternative hypothesis has a free parameter and thus is allowed to 'learn', while the null has the parameter fixed at a certain value.

Scientists With Attitude

At several points, in the book, Edwards uses the concerns of the working scientist as an argument in favor of a likelihood-based reasoning calculus. He thus faults Bayesian statistics for "fail[ing] to answer questions of the type many scientists ask" (p. 54).

This question, I presume, is "What does the data tell my about my hypotheses?" This is distinct from "What should I do?" or "Which of these hypotheses is correct?" in that it only supplies the objective, quantitative measure of support, not the conclusion:
The scientist must be the judge of his own hypotheses, not the statistician. The perpetual sniping which statisticians suffer at the hands of practising scientists is largely due to their collective arrogance in presuming to direct the scientists in his consideration of hypotheses; the best contribution they can make is to provide some measure of 'support', and the failure of all but a few to admit the weaknesses of the conventional approaches has not improved the scientists' opinion. (p. 34)
In brief form, this leads to the following tirade against Bayesian statistics:
Inverse probability, in its various forms, is considered and rejected on the grounds of logic (concerning the representation of ignorance), utility (it does not allow answers in the form desired), oversimplicity (in problems involving the treatment of frequency probabilities) and inconsistency (in the allocation of prior probability distributions). (p. 67–68)

Fisher: The Design of Experiments (4th ed., 1947), Chapter I

Fisher; from Judea Pearl's website.
Now, here's a revealing turn of phrase:
In the foregoing paragraphs the subject-matter of this book has been regarded from the point of view of an experimenter, who wishes to carry out his work competently, and having done so wishes to safeguard his results, so far as they are validly established, from ignorant criticism by different sorts of superior persons. (p. 3)
You could hardly spell out more explicitly the philosophy that lies behind Fisher's concept of statistics: It's a strategic ritual, not designed to ensure a result, but to protect against criticism.

Perfectly Rigorous and Unequivocal

Such protection only goes as far as the mathematical consensus on the validity of the logic. But Fisher goes on to state that "rigorous deductive argument" is possible even in the context of random events, citing gambling as a proof of concept:
The mere fact that inductive inferences are uncertain cannot, therefore, by accepted as precluding perfectly rigorous and unequivocal inference. (p. 4)
This seems to confuse the issues of probability and statistics, unless his argument here really only amounts to saying that distributions are non-stochastic entities.

Useless for Scientific Purposes

This leads him to a discussion of "inverse probability," which he gives three reasons for rejecting: First,
… advocates of inverse probability seem forced to regard mathematical probability, not as an objective quantity measured by observed frequencies, but as measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes. (p. 6–7)
Second, Bayes' axiom (about the flat prior for a coin flip) is not self-evident, that is, the choice of prior is not unequivocal (p. 7).

Ever Since the Dawn of Man…

And third,
… inverse probability has been only very rarely used in the justification of conclusions from experimental facts, although the theory has been widely taught, and is widespread in the literature of probability. Whatever the reasons are which could give experimenters confidence that they can draw valid conclusions from their results, they seem to act just as powerfully whether the experimenter has heard of the theory of inverse probability or not. (p. 7)
That's a funny sociological proof, given that he has just rejected Bayesian statistics for its psychologism. But he himself sometimes seems to think that his statistics is a kind of theory of learning, whatever that means:
Men have always been capable of some mental processes of the kind we call "learning by experience." … Experimental observations are only experience carefully planned in advance, and designed to form a secure basis of new knowledge; (p. 8)

Saturday, December 6, 2014

Chernoff: "A career in statistics" (2014)

Chernoff makes some interesting remarks about the philosophy of statistics in his recent autobiographical essay.

First, an anecdote about the three classical decision criteria considered in decision theory:
I had always been interested in the philosophical issues in statistics, and Jimmie Savage claimed to have resolved one. Wald had proposed the minimax criterion for deciding how to select one among the many “admissible” strategies. Some students at Columbia had wondered why Wald was so tentative in proposing this criterion. The criterion made a good deal of sense in dealing with two-person zero-sum games, but the rationalization seemed weak for games against nature. In fact, a naive use of this criterion would suggest suicide if there was a possibility of a horrible death otherwise. Savage pointed out that in all the examples Wald used, his loss was not an absolute loss, but a regret for not doing the best possible under the actual state of nature. He proposed that minimax regret would resolve the problem. At first I bought his claim, but later discovered a simple example where minimax regret had a similar problem to that of minimax expected loss. For another example the criterion led to selecting the strategy A, but if B was forbidden, it led to C and not A. This was one of the characteristics forbidden in Arrow’s thesis.
Savage tried to defend his method, but soon gave in with the remark that perhaps we should examine the work of de Finetti on the Bayesian approach to inference. He later became a sort of high priest in the ensuing controversy between the Bayesians and the misnamed frequentists. (pp. 32–33)
He immediately moves on to one of his own more dismal conclusions about the issue:
I posed a list of properties that an objective scientist should require of a criterion for decision theory problems. There was no criterion satisfying that list in a problem with a finite number of states of nature, unless we canceled one of the requirements. In that case the only criterion was one of all states being equally likely. To me that meant that there could be no objective way of doing science. I held back publishing those results for a few years hoping that time would resolve the issue (Chernoff, 1954). (p. 33)
I haven't read the paper he is referring to here, but it seems like the text has jumbled up the conclusions: I think what he meant to say is that there is no single good decision function when we have infinitely many states, since the criteria essentially require us to use a uniform distribution. But I would have to check the details.

Finally, he moves on to a more on-record explication of his position:
In the controversy, I remained a frequentist. My main objection to Bayesian philosophy and practice was based on the choice of the prior probability. In principle, it should come from the initial belief. Does that come from birth? If we use instead a non-informative prior, the choice of one may carry hidden assumptions in complicated problems. Besides, the necessary calculation was very forbidding at that time. The fact that randomized strategies are not needed for Bayes procedures is disconcerting, considering the important role of random sampling. On the other hand, frequentist criteria lead to the contradiction of the reasonable criteria of rationality demanded by the derivation of Bayesian theory, and thus statisticians have to be very careful about the use of frequentist methods. 
In recent years, my reasoning has been that one does not understand a problem unless it can be stated in terms of a Bayesian decision problem. If one does not understand the problem, the attempts to solve it are like shooting in the dark. If one understands the problem, it is not necessary to attack it using Bayesian analysis. My thoughts on inference have not grown much since then in spite of my initial attraction to statistics that came from the philosophical impact of Neyman–Pearson and decision theory. (p. 33)

Duda and Hart: Pattern Classification and Scene Analysis (1973)

A lot of references seem to indicate that this book played an important role in the popularization of Bayesian methods in machine learning. It also provides an interesting missing link between the statistics and decision theory of the 1950s and the field of machine learning in the form it now has.



Interestingly, their rejection of minimax approaches to decision theory is rather casual, relative to how toxic the debate actually was:
In fact, the Bayesian approach is avoided by many statisticians, partly because there are problems for which a decision is made only once (so that average loss is not meaningful), and partly because there may be no reasonable way to determine the a priori probabilities. Neither of these difficulties seems to present a serious problem in typical pattern recognition applications, and for simplicity we have taken a strictly Bayesian approach. (p. 36)
Compare this to David Blackwell's compact statement of intent in Basic Statistics (1969):
This book indicates the content of a lower-division basic statistics course I have taught several times at Berkeley. […] The approach is intuitive, informal, concrete, decision-theoretic, and Bayesian. (p. v)
Duda and Hart also provide a number of quite interesting references:
The text by Nilsson (1965) provides an exceptionally clear treatment of classification procedures. (p. 8)
There are many interesting subject areas that are related to this book but beyond its scope. […] Those interested in philosophical issues will find the books by Watanabe (1969) and Bongard (1970) thought provoking. (p. 8)
We are also fond of the the text by Ferguson (1967), who presents many topics in statistics from a decision theoretic viewpoint. (p. 36)
Chow (1957) was one of the first to apply Bayesian decision theory to pattern recognition. His analysis include a provision for rejection, and he later estiablished a funcamental relation between error and reject rates (Chow 1970). (p. 36)