Notebooks on Language: Harold Jeffreys

Showing posts with label Harold Jeffreys. Show all posts

Friday, May 8, 2015

Edwards on Fisher (2005)

In his discussion of Fisher's 1925 book Statistical Methods for Research Workers, Edwards writes:

It was not until 1950 that the word ‘Bayesian’ was coined, by Fisher himself, to refer to inverse probability (p. 862)

Can this really be true? To be sure, "inverse probability" was the more common term before 1950, but was "Bayesian" really never used, at all?

I thought this claim would not stand up to 30 seconds of googling, but I was wrong. The only reference I've found so far is from the bibliography in Jimmy Savage's Lecture Notes on Mathematical Statistics, which includes the following item:

Jeffreys, Harold. Theory of Probability. New York: Oxford University Press, 1939. 380 pp. A highly controversial book on the philosophical foundations of statistics by the most foremost modern exponent of the Bayesian heresy. Examples largely from geophysics.

I don't have access to Savage's book here, so I can't vouch for the year of publication, but Amazon, WorldCat, Google Books, and a few other resources all give the date as 1947.

If that's true, then the word "Bayesian" was used at least once before 1950. But one is certainly not a crowd.

Monday, December 8, 2014

Edwards: Likelihood (1972)

Edwards (from his Cambridge site)

The geneticist A. F. W. Edwards is a (now retired) professor of biometry who was massively influenced by Ronald Fisher in his scientific writings. His books Likelihood argues that the likelihood concept is the only sound basis for scientific inference, but it reads at times almost like one long rant against Bayesian statistics (particularly ch. 4) and Neyman-Pearson theory (particularly ch. 9).

Don't Do Probs

As an alternative to these approaches to statistics, Edwards proposes that we limit ourselves to makes assertions only in terms of likelihood, "support" (log-likelihood, p.12), and likelihood ratios. In the brief epilogue of the book, he states that this

… allows us to do most of the things which we want to do, whilst restraining us from doing some things which, perhaps, we should not do. (p. 212)

In particular, this approach emphatically prohibits the comparison of hypotheses in probabilitistic terms. The kind of uncertainty we have about scientific theories is simply not, Edwards states, of a nature that can be quantified in terms of probabilities: "The beliefs are of a different kind," and they are "not commensurate" (p. 53)

The Difference Between Bad and Worse

He briefly mentions Ramsey and his Dutch book-style argument for the calculus of probability, and then goes on to speculate that, had not died so young,

… perhaps he would have argued that his demonstration that absolute degrees of belief in propositions must, for consistency's sake, obey the law of probability, did not compel anyone to apply such a theory to scientific hypotheses. Should they decline to do so (as I do), then they might consider a theory of relative degrees of belief, such as likelihood supplies. (p. 28)

In other words, it might be true that you cannot assign numbers to propositions in any other way than according to the calculus of probabilities, but you can always reject to have a quantitative opinion in the first place (or not make a bet).

Nulls Only

Consistently with Fisher's approach to statistics, Edwards finds it important to distinguish between null and not-null hypotheses: That is, in opposition to Neyman-Pearson theory, he refuses to explicitly formulate the alternative hypothesis against which a chance hypothesis is tested.

Here as elsewhere, this is a serious limitation with quite profound consequences:

It should be noted that the class of hypotheses we call 'statistical' is not necessarily closed with respect to the logical operations of alternation ('or') and negation ('not'). For a hypothesis resulting from either of these operations is likely to be composite, and composite hypotheses do not have well-defined statistical consequences, because the probabilities of occurrence of the component simple hypotheses are undefined. For example, if $p$ is the parameter of a binomial model, about which inferences are to be made from some particular binomial results, '$p=\frac{1}{2}$' is a statistical hypothesis because its consequences are well-defined in probability terms, but its negation, '$p\neq\frac{1}{2}$', is not a statistical hypothesis, its consequences being ill-defined. Similarly, '$p=\frac{1}{4}$ or $p=\frac{1}{2}$' is not a statistical hypothesis, except in the trivial case of each simple hypothesis having identical consequences. (p. 5)

This should also be contrasted with Jeffreys' approach, in which the alternative hypothesis has a free parameter and thus is allowed to 'learn', while the null has the parameter fixed at a certain value.

Scientists With Attitude

At several points, in the book, Edwards uses the concerns of the working scientist as an argument in favor of a likelihood-based reasoning calculus. He thus faults Bayesian statistics for "fail[ing] to answer questions of the type many scientists ask" (p. 54).

This question, I presume, is "What does the data tell my about my hypotheses?" This is distinct from "What should I do?" or "Which of these hypotheses is correct?" in that it only supplies the objective, quantitative measure of support, not the conclusion:

The scientist must be the judge of his own hypotheses, not the statistician. The perpetual sniping which statisticians suffer at the hands of practising scientists is largely due to their collective arrogance in presuming to direct the scientists in his consideration of hypotheses; the best contribution they can make is to provide some measure of 'support', and the failure of all but a few to admit the weaknesses of the conventional approaches has not improved the scientists' opinion. (p. 34)

In brief form, this leads to the following tirade against Bayesian statistics:

Inverse probability, in its various forms, is considered and rejected on the grounds of logic (concerning the representation of ignorance), utility (it does not allow answers in the form desired), oversimplicity (in problems involving the treatment of frequency probabilities) and inconsistency (in the allocation of prior probability distributions). (p. 67–68)

Saturday, July 26, 2014

Wald: Sequential Analysis (1947)

I've always thought that Shannon's insights in the 1940s papers seemed like pure magic, but this book suggests that at parts of his way of thinking were already in the air at the time: From his own frequentist perspective, Wald comes oddly close to defining a version of information theory.

The central question of the book is:

How can we devise decision procedures that map observations into {Accept, Reject, Continue} in such a way that (1) the probability of wrongly choosing Accept or Reject is low, and (2) the expected number of Continue decisions is low?

Throughout the book, the answer that he proposes is to use likelihood ratio tests. This puts him strangely close to the Bayesian tradition, including for instance Chapter 5 of Jeffreys' Theory of Probability.

A sequential binomial test ending in rejection (p. 94)

In particular, the coin flipping example that Jeffreys considers in his Chapter 5.1 is very close in spirit to the sequential binomial test that Wald considers in Chapter 5 of Sequential Analysis. However, some differences are:

Wald compares two given parameter values p₀ and p₁, while Jeffreys compares a model with a free parameter to one with a fixed value for that parameter.
Jeffreys assigns prior probabilities to everything; but Wald only uses the likelihoods given the two parameters. From his perspective, the statistical test will thus have different characteristics depending on what the underlying situation is, and he performs no averaging over these values.

This last point also means that Wald is barred from actually inventing the notion of mutual information, although he comes very close. Since he cannot take a single average over the log-likelihood ratio, he cannot compute any single statistic, but always has to bet on two horses simultanously.

Wednesday, April 2, 2014

Jeffreys: Scientific Inference, third edition (1973)

This book, first published 1931, covers much of the same ground as Jeffreys' Theory of Probability (1939), but it's shorter and easier to read.

It touches on a number of extremely interesting topics, including

the asymptotic equivalence of Bayesian inference and maximum likelihood inference (§4.3, pp. 73–74)
a Bayesian notion of "statistical test" (§3.5, pp. 54–56)
priors based on theory complexity (§2.5, especially pp. 34–39)
the convergence of predictive distributions to the truth under (very) nice conditions (§2.5)
an inductive justification of the principle of induction through hierarchical models (§3.7, pp. 58–60)
a critique of the frequency theory of probability (§9.21, pp. 193–197)
a number of other philosophical issues surrounding induction and probability (§9)

I might write about some of these issues later, but now I want to focus on a specific little detail that I liked. It's a combinatorical argument for Laplace's rule, which I have otherwise only seen justified through the use of Euler integrals.

Laplace's rule: Add a "virtual count" to each bucket before the parameter estimation.

The Generative Set-Up

Suppose that you have a population of n swans, r of which are white. We'll assume that r is uniformly distributed on 0, 1, 2, …, n. You now inspect a sample of m < n swans and find s white swans among them.

It then turns out that the probability that the next swan is white is completely independent of n: Whatever the size of the population is, the probability of seeing one more white swan turns out to be (s + 1)/(m + 1) when we integrate out the effect of r.

A population of n swans contains r white ones;
in a sample of m swans, s are white.

Let me go into a little more detail. Given n, m, and r, the probability of finding s white swans in the sample follows a hypergeometric distribution; that is,

Pr( s | n, m, r ) == C(r, s) C(n – r, m – s) / C(n, m),

where C(a, b) is my one-dimensional notation for the binomial coefficient "a choose b." The argument for this formula is that

C(r, s) is the number ways of choosing s white swans out of a total of r white swans.
C(n – r, m – s) is the number of ways of choosing the remaining m – s swans in the sample from the remaining n – r swans in the population.
C(n, m) is the total number of ways to sample m swans from a population of n.

The numerator thus counts the number of ways to select the sample so that it respects the constraint set by the number s, while the denominator counts the number of samples with or without this constraint.

Inverse Hypergeometric Probabilities

In general, binomial coefficients have the following two properties:

C(a, b) (a – b) == C(a, b + 1) (b + 1)
C(a + 1, b + 1) == C(a, b) (a + 1)/(b + 1)

We'll need both of these facts below. They can be shown directly by cancelling out factors in the factorial expression for the binomial coefficients.

One consequence is that Bayes' rule takes on a particularly simple form in the hypergeometric case:

Pr( r | n, m, s ) == Pr( s | n, m, r ) (m + 1)/(n + 1)
Pr( s | n, m, r ) == Pr( r | n, m, s ) (n + 1)/(m + 1)
Pr( s + 1 | n, m + 1, r ) == Pr( r | n, m + 1, s + 1 ) (n + 1)/(m + 1)

These equalities are, of course, saying the same thing, but I state all three forms because they will all come up.

By using the first of the two rules for binomial coefficients, we can also show that

Pr( s | n, m, r ) (r – s)/(n – m) == Pr( s + 1 | n, m + 1, r ) (s + 1)/(n + 1)

According to the last fact about the inverse hypergeometric probabilities, this also means that

Pr( s | n, m, r ) (r – s)/(n – m) == Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)

I have cancelled two occurrences of (n + 1) to arrive at this expression. I will use this fact below.

Expanding the Predictive Probability

By assumption, we have inspected s out of the r white swans, so there are r – s white swans left. We have further inspected m out of the n swans, so there is a total of n – m swans left. The probability that the next swan will be white is thus (r – s)/(n – m).

If we call this event q, then we have, by the sum rule of probability,

Pr( q | n, m, s ) == Σr Pr( q, r | n, m, s )

By the chain rule of probabilities, we further have

Pr( q | n, m, s ) == Σr Pr( q | n, m, s, r ) Pr( r | n, m, s )

As argued above, we have

Pr( q | n, m, s, r ) = (r – s)/(n – m)
Pr( r | n, m, s ) = Pr( s | n, m, r ) (m + 1)/(n + 1)
Pr( s | n, m, r ) (r – s)/(n – m) == Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)

Putting these facts together and cancelling, we get

Pr( q | n, m, s ) == (s + 1)/(m + 1) Σr Pr( r | n, m + 1, s + 1 )

I have pulled the constant factors out the summation here. Notice further that the summation is a sum of probabilities for the possible values of r. It must consequently sum to 1. We thus have

Pr( q | n, m, s ) == (s + 1) / (m + 1)

as we wanted to prove.

Carnap and Jeffrey: Studies in Inductive Logic and Probability, Vol. 1 (1971)

Carnap at his desk; from carnap.org.

This is an anthology edited by Rudolf Carnap and philosopher Richard C. Jeffrey (not to be confused with physicist Harold Jeffreys).

The majority of the book is dedicated to two essays on probability which Carnap intended to be a substitute for the (never realized) second volume of the Logical Foundations of Probability (1950). Carnap's idea is that rational belief should be understood as the result of probabilistic conditioning on a special kind of "nice" prior.

An Inconsistent Axiomatization of Rationality

In order to demarcate the realm of rational belief, Carnap has to specify the set of permitted starting states of the system and its update rules. He does so by means of the following four "rationality assumptions":

Coherence — You must conform to the axioms of probability; or in terms of gambling, you may not assign positive utility to any gamble that guarantees a strict loss.
Strict Coherence — You may not assign an a priori probability of 0 to any event; or equivalently, you may not assign positive utility to a gamble that renders a strict loss possible and a weak loss necessary.
Belief revision depends only on the evidence — Your beliefs at any time must be determined completely by your prior beliefs and your evidence (nothing else). Assuming axiom 1 is met, this comes down to producing new beliefs by conditioning.
Symmetry — You must assign the same probability to propositions of the same logical form, i.e., F(x) and F(y).

These axioms are inconsistent in a number of cases, and Carnap does not seem to realize. The problems are that

Many infinite sets cannot be equipped with a proper, regular, and symmetric distribution. For instance, there is no "uniform distribution on the integers";
There may be interdependent propositional functions in the language, and a prior that renders one symmetric might render another asymmetric. Consider for instance F(x) = "the box has a side-length between x and x + 1" and G(x) = "the box has a volume between x and x + 1".

Maybe Carnap had a vague idea about the first problem — at least he seems to assume that the sample space is finite throughout the first essay ("Inductive Logic and Rational Decisions," cf. pp. 7 and 14).

In the second essay, however, he explicitly says that there are countably many individuals in the language, so it would seem that he owes us a proper, coherent, and regular distribution on the integers ("A Basic System of Inductive Logic, Part I," ch. 9, p. 117).

Both Jaynes and Jeffreys made attempts at tackling the second problem by choosing priors that would decrease the tension between two descriptions. Jeffreys, for instance, showed that a probability density function of the form f(t) = 1/t (restricted to some positive interval) makes it irrelevant whether a normal distribution is described in terms of its variance or its precision parameter. Jaynes, by an essentially identical argument, "solved" Bertrand's paradox by choosing a prior that minimizes the discrepancy between a side-length description and a volume-description.

What is a Rationality Assumption?

Carnap knows that probability theory has to be founded on something other than probability theory to make sense and explains that "the reasons for our choice of the axioms are not purely logical." (p. 26; his emphasis).

Rather, they are game-theoretic: In order to argue against the use of some a priori probability measure (or "M-function"), Carnap must show why somebody starting from this prior

…, in a certain possible knowledge situation, would be led to an unreasonable decision. Thus, in order to give my reasons for the axiom, I move from pure logic to the context of decision theory and speak about beliefs, actions, possible losses, and the like. (p. 26)

That sounds circular, but the rest of his discussion seems to indicate that he is thinking about worst-case (or minimax) decision theory, which makes sense.

"Reduced to one"

What does not make sense, however, is his unfounded faith that there are always reasons to prefer one M-function over another:

Even on the basis of all axioms that I would accept at the present time …, the number of admissible M-functions, i.e., those that satisfy all accepted axioms, is still infinite; but their class is immensely smaller than that of all coherent M-functions [i.e., all probability measures]. There will presumably be further axioms, justified in the same way by considerations of rationality. We do not know today whether in this future development the number of admissible M-functions will always remain infinite or will become finite and possibly even be reduced to one. Therefore, at the present time I do not assert that there is only one rational Cr₀-function [= initial credence = credence at time 0]. (p. 27)

But clearly, he hopes so.

Carnap the Moralist

Interestingly, Carnap makes a very direct connection between moral character and epistemic habits. This comes out most clearly in a passage in which he explains that rationality is a matter of belief revision rather than belief:

When we wish to judge the morality of a person, we do not simply look at some of his acts; we study rather his character, the system of his moral values, which is part of his utility function. Observations of single acts without knowledge of motives give little basis for judgment. Similarly if we wish to judge the rationality of a person's beliefs, we should not look simply at his present beliefs. Information on his beliefs without knowledge of the evidence out of which they arose tells us little. We must rather study the way way in which the person forms his beliefs on the basis of evidence. In other words, we should study his credibility function, not simply his present credence function. (p. 22)

The "Reasonable Man" (to use the 18th century terminology) is thus the man who updates his beliefs in a responsible, careful, and modest fashion. Lack of reason is the stubborn rejection of norms of evidence, a refusal to surrender to the "truth cure."

As an illustration of what he has in mind, Carnap considers an urn example in which a person X observes a majority of black balls being drawn (E), and Y observes a majority of white balls (E'). He continues:

Let H be the prediction that the next ball drawn will be white. Suppose that for both X and Y the credence of H is 2/3. Then we would judge this same credence value 2/3 of the proposition H as unreasonable for X, but reasonable for Y. We would condemn a credibility function Cred as nonrational if Cred(H | E) = 2/3; while the result Cred(H | E') = 2/3 would be no ground for condemnation. (p. 22)

So although he elsewhere argues that rationality is a matter of risk minimization, he nevertheless falls right into the moralistic language of "grounds for condemnation."

Do the Robot

A similar formulation appears earlier, as he discusses the axiom that belief revision is based on evidence only. For a person satisfying this criterion, Carnap explains,

… changes in his credence function are influenced only by his observational results, but not by any other factors, e.g., feelings like his hopes or fears concerning a possible future event H, feelings that in fact often influence the beliefs of all actual human beings. (pp. 15–16)

Like Jaynes, he defends this idealization by reference to a hypothetical design problem:

Thinking about the design of a robot might help us in finding rules of rationality. Once found, these rules can be applied not only in the construction of a robot but also in advising human beings in their effort to make their decisions as rational as their limited abilities permit. (p. 17)

Another way of saying the same thing is that we should first describe the machine that we would want to do the job, and then tell people how to become more like that machine.

Wednesday, March 20, 2013

Bernardo: "Expected Information as Expected Utility" (1979)

Following a suggestion by Dennis Lindley (1956), this paper suggests that the appropriate measure of the expected value of an experiment X with respect to a target parameter Θ is the mutual information between the two, that is,

I(X;Θ) = H(Θ) – H(X,Θ).

Bernado calls this the "expected useful information" contained in the experiment (§2).

Proper Scoring Rules

The paper also contains a uniqueness theorem about so-called proper scoring rules (§3–4).

A "scoring rule" is a scheme for rewarding an agent (the "scientist") who reports probability distributions to you. It may depend on the distribution and on the actual observed outcome. For instance, a feasible rule is to pay the scientist p(x) dollars for the density function p if in the event that x occurred.

That function, however, would under many common rationality assumptions give the scientist an incentive to misreport his or her actual probability estimates. We consequently define a "proper" scoring rule as one that is hack-proof in the sense that the best course of action under that rule is to report your actual probability estimates.

An example of a proper scoring rule is –log p(x), but apparently, there are others. Barnardo refers to Robert Buehler amd I. J. Good's papers in Foundations of Statistical Inference (1971) for further examples. Unfortunately, that book seems to be a bit difficult to get a hold of.

Nice and Proper Scoring Rules

The theorem that Bernardo proves is the following: The only proper scoring rule which are both smooth and local (as defined below) are functions of the logarithmic form

u(p,x) = a log p(x) + b(x)

where a is a constant, and b is a real-valued function on the sample space.

As a corollary, the scientist's optimal expected payoff is H(X) + B/a, where B is the average value of the function b under the scientist's subjective probabilities. It also follows that the the optimal course of action for the scientist under this scheme will be to provide the maximum amount of information that is consistent with his or her beliefs.

So what does "smooth" and "local" mean?

Bernardo doesn't define "smooth," but usually in real analysis, a smooth function is one that can be differentiated indefinitely often. However, Bernardo refers to the physics textbook by Harold and Bertha Jeffreys (1972) for a definition. I don't know if they use the word the same way.

A scoring rule u is "local" if the reward that the scientist receives in event the event of x only depends on x and on the probability that he or she assigned to x. In other words, a local scoring rule u can be rewritten in terms of a function v whose first argument is a probability rather than a probability distribution:

u(p,x) = v(w,x),

where w = q(x), the reported probabilityof x (which does not necessarily equal the actual subjective probability p).

How To Prove This Theorem

I haven't thought too hard about the proof, but here's the gist that I got out of it: First, you use the the method of Lagrange multipliers to show that the w-derivative of the function

∫ v(w,x) p(x) dx – λ ( ∫ w dx – 1)

is zero for all w when a function q is optimal. You then conclude that q = p fulfills this condition, since u was assumed to be a proper scoring rule. You then have a differential equation on your hands, and you go on to discover that its only solutions are of the postulated form.

Subscribe to: Posts ( Atom )