Notebooks on Language: Fisher: "Statistical Methods and Scientific Induction" (1955)

Tuesday, April 15, 2014

Fisher: "Statistical Methods and Scientific Induction" (1955)

Ronald Fisher; image from Wikimedia Commons.

In this brief paper, Sir Ronald Fisher militates against what he sees as wrong and absurd interpretations of the notion of a statistical test.

The Ideology of Statistics

The core of his argument is that a test only gives positive information when yields a significant difference and thus warrants the rejection of a hypothesis — an absence of a significant difference does not mean "accept." He contends that

… this difference in point of view originated when Neyman, thinking that he was correcting and improving my own early work on tests of significance, … in fact reinterpreted them in terms of that technological and commercial apparatus which is known as an acceptance procedure. (p. 69)

And although acceptance procedures might be good enough for commerce, they have no place in science:

I am casting no contempt on acceptance procedures, and I am thankful, whenever I travel by air, that the high level of precision and reliability required can really be achieved by such means. But the logical differences between such an operation and the work of scientific discovery by physical or biological experimentation seem to me so wide that the analogy between them is not helpful, and the identification of the two sorts of operation is decidedly misleading. (pp. 69–70)

Then comes the juicy part:

I shall hope to bring out some of the logical differences more distinctly, but there is also, I fancy, in the background an ideological difference. Russians are made familiar with the ideal that research in pure science can and should be geared to technological performance, in the comprehensive organized effort of a five-year plan for the nation. How far, within such a system, personal and individual inferences from observed facts are permissible we do not know, but it may be safer, and even, in such a political atmosphere, more agreeable, to regard one's scientific work simply as a contributary element in a great machine, and to conceal rather than to advertise the selfish and perhaps heretical aim of understanding for oneself the scientific situation. In the U.S. also the great importance of organized technology has I think made it easy to confuse the process appropriate for drawing correct conclusions, with those aimed rather at, let us say, speeding production, or saving money. There is therefore something to be gained by at least being being able to think of our scientific problems in a language distinct from that of technological efficiency. (p. 70)

So there you have it: In the technological regime of either of the two Cold War superpowers, "learning," "inference," and private, inner thought are taboo, according to Fisher. Presumably we are to contrast this with the aims of British science going back to Newton.

The Three Issues

Fisher singles out three phrases that he finds particularly offensive in scientific statistics:

"Repeated sampling from the same distribution"
Errors of the "second kind"
"Inductive behaviour"

I'll discuss these one by one.

1. "Repeated sampling from the same distribution"

The issue with the first one is not completely clear to me, but here is what I make of his discussion (pp. 71–72): Suppose you are performing a test to see whether the mean of some population has a specific value; suppose further that the standard deviation of that population is unknown, but that you have estimated it based on the available sample.

The problem then is, if I understand Fisher correctly, that the test depends on the standard deviation being constant and known, but in reality, it is an unknown quantity that you have estimated by a maximum likelihood method. This is, strictly speaking, illegitimate, since any estimate should be based on a numerous and representative sample; but since the standard deviation is a property of samples of size N, you should really have M samples of a sample of size N in order to have some data to estimate from. But clearly, this sets a far too high standard for the amount of data required.

It's a convoluted argument, but I think it makes sense from a rigorously frequentist standpoint: If parameters are consistently interpreted as frequencies, then the only legitimate statistical procedure for learning about an unknown quantity t is to obtain a large number of samples dependent on t and then wait for the law of large numbers to kick in.

Strictly speaking, this means that the amount of data points you need in order to estimate all the parameters in a model will grow exponentially in the number of parameters. That sounds sort of crazy, but if you do not allow yourself to have any model in the absence of data, you really have to wait for the data to overwhelm your initial ignorance before you can say that you have a model of the situation. That takes time.

2. Errors of the "Second Kind"

Errors of the first kind are false negatives: Cases in which, for instance, a population in fact has mean m, but nevertheless exhibits a sample average so far away from m that the hypothesis is rejected. Such errors have a frequentist interpretation, because the likelihoods given m are well-defined even in the absence of a prior distribution over m.

Errors of the second kind are false positives: Some other mean m' different from m produces a sample average so close to m that the false hypothesis of a mean of m is confirmed. This kind of error has no frequentist interpretation, because it requires the alternative hypotheses m' to have prior probabilities, and because it requires that there be a loss function associated with accepting the hypothesis of m when the true mean m' is close to m.

Jerzy Neyman in the classroom, 1973; image from Wikimedia Commons.

Fisher is not willing to assume any of those two instruments. He writes:

It was only when the relation between a test of significance and its corresponding null hypothesis was confused with an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is rejected wrongly, from errors in which it is "accepted wrongly" as the phrase does. (p. 73)

Such language is not just scientifically irresponsible, he thinks — it also misunderstands the private states of mind present in the head of a scientist:

The fashion of speaking of a null hypothesis as "accepted when false", whenever a test of significance gives us no strong reason for rejecting it, and when in fact it is in some way imperfect, shows real ignorance of the research worker's attitude, by suggesting that in such a case he has come to an irreversible decision. (p. 73; Fisher's emphasis)

Of course, neither positive nor negative decisions are immune to revision as more data comes in (cf. p. 76), so Fisher prefers to depict the scientist's attitude as one of cautious learning in the face of data. This contrasts with the forced-choice nature of acceptance procedures:

In an acceptance procedure, on the other hand, acceptance is irreversible, whether the evidence for it was strong or weak. It is the result of applying mechanically rules laid down in advance; no thought is given to the particular case, and the tester's state of mind, or his capacity for learning, is inoperative.

By contrast, conclusions drawn by a scientific worker from a test of significance are provisional, and involve an intelligent attempt to understand the experimental situation. (pp. 73–74; Fisher's emphasis).

Note again the insistence on private states of mind as the hallmark of scientific rationality.

3. "Inductive Behaviour"

The last issue Fisher has with Neyman's brand of statistics is shelves under the heading above, but it is really about an issue of linguistics: Neyman contends (according to Fisher's summary — there is no direct reference) that statements like

There is 5% probability that the sample average deviates strongly from the mean

have a meaningful and well-defined interpretation (in terms of likelihood). On the other hand,

There is 5% probability that the mean deviates strongly from the sample average

is meaningless, because the mean is not a random variable.

Fisher disagrees, not because he is a fan of prior probability distributions on the parameters, but because he thinks that such statements could only ever refer to likelihoods. To make this point vivid, he considers (I am changing the example a bit here) a statement of the form

Pr(m < x) = 5%,

where m is a parameter and x is an observation, and he contrasts this with

Pr(m < 17) = 5%.

If one of these statements has a meaning, he says, clearly the other one must have a meaning too, unless we want to "deny the syllogistic process of making a substitution" (p. 75). But Neyman contends that the probability of a statement of the second kind should be "necessarily either 0 or 1" (p. 75), so that only the former probability (the likelihood given the mean) is well-defined.

Fisher comments:

The paradox is rather childish, for it requires that we should wilfully misinterpret the probability statement so as to pretend that the population to which it refers is not defined by our observations and their precision, but is absolutely independent of them. (p. 75)

By this he means that the reference class (the "population") is defined arbitrarily by our experimental set-up. And as he says about populations earlier in the paper, "no one of them has objective reality, all being products of the statistician's imagination" (p. 71).

An Englishman's Duty

In the conclusion, Fisher comes back to the ethical standards of statistics:

As an act of construction the hypothesis is not altogether impersonal, for the scientist's personal capacity for theorizing comes into it; moreover, the criteria by which it is approved require a certain honesty, or integrity, in their application. (p. 75)

Again, he explains that decision-theoretic methods (such as Bayesian statistics) have no business in scientific inference, since the goal is not optimal decisions, but the attainment of truth:

Finally, in inductive inference we introduce no cost functions for faulty judgments … In fact, scientific research is not geared to maximize the profits of any particular organization, but is rather an attempt to improve public knowledge undertaken as an act of faith to the effect that, as more becomes known, or more surely known, the intelligent pursuit of a great variety of aims, by a great variety of men, and groups of men, will be facilitated. We make no attempt to evaluate these consequences, and do not assume that they are capable of evaluation in any sort of currency.

… We aim, in fact, at methods of inference which should be equally convincing to all rational minds, irrespective of any intentions they may have in utilizing the knowledge inferred.

We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right or other free minds to utilize them in making their own decisions. (p. 77)

We could hardly have it more explicit: The difference in statistical paradigm is one of ethics.

Notebooks on Language