Notebooks on Language: random processes

Showing posts with label random processes. Show all posts

Wednesday, September 16, 2015

Billsingley: Ergodic Theory and Information (1965), pp. 12–14

In the process of proving the ergodic theorem, Billingsley also provides in passing a proof for a weaker theorem related to mixing operations: This states that for a class of processes that satisfy a certain limiting independence condition, visiting frequencies converge to probabilities.

Definitions

More precisely, a time shift $T$ is said to be mixing (p. 12) if$$
P(A \cap T^{-n}B) \; \rightarrow \; P(A)P(B)
$$for all sets $A$ and $B$. The process is thus mixing if information about the present is almost entirely independent of the far future. (A sample path $\omega$ belongs to the set $T^{-n}B$ if $\omega$ will visit $B$ after $n$ time shifts forward.) Billsingley doesn't comment on the choice of the term "mixing," but I assume it should be read as an ill-chosen synonym for "shuffling" in this context.

We will now be interested in the behavior of the visiting frequencies$$
P_n(A) \; := \; \frac{1}{n} \sum_{k=0}^{n-1} \mathbb{I}_A(T^k\omega),
$$where $\mathbb{I}_A$ is the indicator function of some bundles of sample paths. This quantity is a random variable with a distribution defined by the distribution on the sample paths $\omega$.

Note that the traditional notion of visiting frequency (i.e., how often is the ace of spades on top of the deck?) is a special case of this concept of visiting frequency, in which the bundle $A$ is defined entirely in terms of the 0th coordinate of the sample path $\omega = (\ldots, \omega_{-2}, \omega_{-2}, \omega_{0}, \omega_{1}, \omega_{2}, \ldots)$. In the general case, the question of whether $\omega\in A$ could involve information stored at arbitrary coordinates of $\omega$ (today, tomorrow, yesterday, or any other day).

Theorem and Proof Strategy

The mixing theorem now says the following: Suppose that $P$ is a probability measure on the set of sample paths, and suppose that the time shift $T$ is mixing and measure-preserving with respect to this measure. Then$$
P_n(A) \; \rightarrow \; P(A)
$$in probability.

The proof of this claim involves two things:

Showing that $E[P_n(A)] = P(A)$;
Showing that $Var[P_n(A)] \rightarrow 0$ for $n \rightarrow \infty$.

These two assumptions collectively imply convergence in probability (by Markov's inequality).

Identity in Mean

The time shift $T$ is assumed to preserve the measure $P$. We thus have $$
E[f(\omega)] \; = \; E[f(T\omega)] \; = \; E[f(T^2\omega)] \; = \; \cdots
$$for any measurable function $f$. It follows that$$
E[\mathbb{I}_A(\omega)] \; = \;
E[\mathbb{I}_A(T\omega)] \; = \;
E[\mathbb{I}_A(T^2\omega)] \; = \; \cdots
$$and that these are all equal to $P(A)$. By the linearity of expectations, we therefore get that $E[P_n(A)] = P(A)$.

This proves that $P_n(A)$ at least has the right mean. Convergence to any other value than $P(A)$ is therefore out of the question, and it now only remains to be seen that the variance of $P_n(A)$ also goes to 0 as $n$ goes to infinity.

Vanishing Variance: The Idea

Consider therefore the variance$$
Var[P_n(A)] \; = \; E\left[ \left(P_n(A) - P(A)\right)^2 \right].
$$In order to expand the square under this expectation, it will be helpful to break the term $P(A)$ up into $n$ equal chunks and then write the whole thing as a sum:$$
P_n(A) - P(A) \; = \; \frac{1}{n} \sum_{k=0}^{n-1} (\mathbb{I}_A(T^k\omega) - P(A)).
$$If we square of this sum of $n$ terms and take the expectation of this expasion, we will get a sum of $n^2$ cross-terms, each of which are expected values of the form$$
Cov(i,j) \; := \; E\left[ \left(\mathbb{I}_A(T^i\omega) - P(A)\right) \times \left (\mathbb{I}_A(T^j\omega) - P(A)\right) \right].
$$By expanding this product and using that $E[\mathbb{I}_A(T^k\omega)]=P(A)$, we can find that$$
Cov(i,j) \; = \; E\left[\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)\right] - P(A)^2.
$$The random variable $\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)$ takes the value 1 on $\omega$ if and only if $\omega$ will visit $A$ after both $i$ and $j$ time steps. Hence,$$
Cov(i,j) \; = \; P(T^{-i}A \cap T^{-j}A) - P(A)^2,
$$and we can write$$
Var[P_n(A)] \; = \; \frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j).
$$The goal will now be to show the sum of these $n \times n$ converges to $0$ as we increase $n$.

Vanishing Variance: The Steps

Since $T$ is assumed to be measure-preserving, the value of $Cov(i,j)$ is completely determined by the distance between $i$ and $j$, not their absolute values. We thus have$$
Cov(i,j) \; = \; Cov(n+i, n+j)
$$for all $i$, $j$, and $n$.

By the mixing assumption, $Cov(0,j) \rightarrow 0$ for $j \rightarrow \infty$. Together with the observation about $|i-j|$, this implies that $$Cov(i,j) \;\rightarrow\; 0$$ whenever $|i-j| \rightarrow \infty$. If we imagine the values of $Cov(i,j)$ tabulated in a huge, square grid, the numbers far away from the diagonal are thus tiny. We will now sum up the value of these $n^2$ numbers $Cov(i,j)$ by breaking them into $n$ classes according to the value of $d = |i-j|$.

To do so, we upper-bound the number of members in each class by $2n$: There are for instance $n$ terms with $|i-j|=0$, and $2(n-1)$ terms with $|i-j|=1$. (This upper-bound corresponds graphically to covering the $n\times n$ square with a tilted $n\sqrt{2} \times n\sqrt{2}$ square, except for some complications surrounding $d=0$.)

We thus have$$
Var[P_n(A)]
\; = \;
\frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j)
\; \leq \;
\frac{2}{n} \sum_{d=0}^{n-1} Cov(0,d).
$$This last quantity is an average of a list of numbers that converge to $0$; this average itself must therefore also converge to $0$ as $n\rightarrow \infty$. This proves that$$
Var[P_n(A)] \; \rightarrow \; 0]
$$for $n \rightarrow \infty$, and we can conclude that $P_n(A) \rightarrow P(A)$ in probability.

Applicability

Note that this mixing assumption is very strong. A periodic Markov chain, for instance, will usually not be mixing: Consider for instance a process that alternates between odd and even values; the the joint probability $P(evens \cap T^n odds)$ will alternate between 0 and 1 as $n$ runs through the natural numbers. In many cases of practical interest, we therefore do in fact need the increased generality of the classic ergodic theorem.

Thursday, September 3, 2015

Billingsley: Ergodic Theory and Information (1965)

I'm reading this book (now for the second time) for its proof of Birkhoff's ergodic theorem. I was held up by lots of technical details the first time around, but I've gotten a bit further now.

Definitions

The set-up is the following: A set $X$ is given along with a time shift $T: X\rightarrow X$. We further have a probability distribution on the set of sample paths $\Omega=X^{\infty}$. For each integrable function $f: \Omega\rightarrow \mathbb{R}$, we consider the time-average$$
A_n f(\omega) \;=\; \frac{1}{n} \sum_{i=0}^{n-1} f(T^i \omega).
$$We are interested in investigating whether such averages converge as $n\rightarrow\infty$. If so, we would also like to know whether the limit is the same on all sample paths.

Notice that $f$ is a function of the whole sample path, not just of a single coordinate. The function $f$ could for instance measure the difference between two specific coordinates, or it could return a limiting frequency associated with the sample path.

If $f$ has the same value on the entire orbit $$\omega,\, T\omega,\, T^2\omega,\, T^3\omega,\, \ldots$$we say that the function $f$ is time-invariant. Global properties like quantifications or tail events are time-invariant.

A set $A$ is similarly time-invariant if $TA=A$. (I like to call such invariants as "trapping sets.") This is a purely set-theoretic notion, and doesn't involve the probability measure on the sample paths. However, we say that an invariant set $A$ is a non-trivial if $0 < P(A) < 1$.

Statement

One of the things that are confusing about the so-called ergodic theorem is that it actually doesn't involve ergodicity very centrally. Instead it makes the following statement:

Suppose that the time shift $T$ is measure-preserving in the sense that $P(A)=P(TA)$ for all measurable sets $A$. Then the time-averages of any integrable function converges almost everywhere:$$P\left\{\omega : \lim_{n\rightarrow \infty} A_n f(\omega) \textrm{ exists} \right\} \;=\; 1.$$
Suppose in addition that the time shift is ergodic in the sense that all its trapping are trivial with respect to $P$. Then the time averages are all the same on a set with $P$-probability 1:$$P \left\{(\omega_1, \omega_2) : \lim_{n\rightarrow\infty} A_n f(\omega_1) \,=\, \lim_{n\rightarrow\infty} A_n f(\omega_2) \right\} \;=\; 1.$$

Almost all the mathematical energy is spent proving the first part of this theorem. The second part is just a small addendum: Suppose a function $f$ has time averages which are not almost everywhere constant. Then the threshold set$$
\{A_n f \leq \tau\}
$$must have a probability strictly between 0 and 1 for some threshold $\tau$. (This is a general feature of random variables that are not a.e. constant.)

But since the limiting time-average is a time-invariant property (stating something about the right-hand tail of the sample path), this thresholding set is an invariant set. Thus, if the limiting time-averages are not a.e. unique, the time shift has a non-trivial trapping set.

So again, the heavy lifting lies in proving that the time-averages actually converge on all sample paths when the time shift is measure-preserving. Billingsley gives two proofs of this, one following idea by von Neumann, and one following Birkhoff.

The Von Neumann Proof

Let's say that a function is "flat" if it has converging time-averages. We can then summarize von Neumann's proof along roughly these lines:

Time-invariant functions are always flat.
Increment functions of the form $$s(\omega) \;=\; g(\omega) - g(T\omega),$$ where $g$ is any function, are also flat.
Flatness is preserved by taking linear combinations.
Any square-integrable function can be obtained as the $L^2$ limit of (linear combinations of) invariant functions and increment functions.
Moreover, the $L^2$ limit of a sequence of flat $L^2$ functions is flat; in other words, flatness is preserved by limits.
Hence, all $L^2$ functions are flat.

This argument seems somewhat magical at first, but I think more familiarity with the concept of orthogonality and total sets in the $L^p$ Hilbert spaces could make it seem more natural.

Just as an example of how this might work, consider the coordinate function $f(\omega) = X_0(\omega)$. In order to obtain this function as a limit of our basis functions, fix a decay rate $r < 1$ and consider the function$$
g(\omega) \;=\; X_0(\omega) + r X_1(\omega) + r^2 X_2(\omega) + r^3 X_3(\omega) + \cdots
$$Then for $r \rightarrow 1$, we get a converging series of increment functions, $g(\omega) - g(T\omega) \;\rightarrow\; X_0(\omega)$. If the coordinates are bounded (as in a Bernoulli process), this convergence is uniform across all sample paths. Billingsley uses an abstract and non-constructive orthogonality argument to show that such decompositions are always possible.

In order to show that the limit of flat functions is also flat, suppose that $f_1, f_2, f_3, \ldots$ are all flat, and that $f_k \rightarrow f^*$. We can then prove that the averages $A_1f^*, A_2f^*, A_3f^*, \ldots$ form a Cauchy sequence by using the approximation $f^* \approx f_k$ to show that $$| A_n f^* - A_m f^* | \;\approx\; | A_n f_k - A_m f_k | \;\rightarrow\; 0.$$ Of course, the exact argument needs a bit more detail, but this is the idea.

After going through this $L^2$ proof, Billingsley explains how to extend the proof to all of $L^1$. Again, this is a limit argument, showing that the flatness is preserved as we obtain the $L^1$ functions as limits of $L^2$ functions.

The Birkhoff Proof

The second proof follows Birkhoff's 1931 paper. The idea is here to derive the the ergodic theorem from the so-called maximal ergodic theorem (which Birkhoff just calls "the lemma"). This theorem states that if $T$ preserves the measure $P$, then$$
P\left\{ \exists n: A_n f > \varepsilon \right\}
\;\leq\;
\frac{1}{\varepsilon} \int_{\{f>\varepsilon\}} f \;dP.
$$I am not going to go through the proof, but here a few observations that might give a hint about how to interpret this theorem:

For any arbitrary function, we have the inequality $$P(f>\varepsilon) \;\leq\; \frac{1}{\varepsilon} \int_{\{f>\varepsilon\}}f\;dP.$$This is the generalization of Markov's inequality to random variables that are not necessarily positive.
If $f$ is time-invariant, then $A_n f = f$, because the average $A_n f$ then consists of $n$ copies of the number $$f(\omega) \;=\; f(T\omega) \;=\; f(T^2\omega) \;=\; \cdots \;=\; f(T^{n-1}\omega).$$ In this case, the theorem thus follows from the generalized Markov bound above.
Even though we do not assume that $f(\omega)=f(T\omega)$ for a fixed sample path $\omega$, all distributional statements we can make about $f(\omega)$ also apply to $f(T\omega)$ because $T$ is assumed to be measure-preserving; for instance, the two functions have the same mean when $\omega \sim P$. This fact of course plays a crucial part in the proof.
By putting $(f,\varepsilon) := (-f,-\varepsilon)$, we see that the maximal ergodic theorem is equivalent to$$P\left\{ \exists n: A_n f < \varepsilon \right\}
\;\leq\;
\frac{1}{\varepsilon} \int_{\{f<\varepsilon\}} f \;dP.
$$

The maximal ergodic theorem can be used to prove the general (or "pointwise") ergodic theorem. The trick is to find both an upper and a lower bound on the non-convergence probability
$$
q \;=\; P\{ \omega :\, \liminf A_n f(\omega) < a < b < \limsup A_n f(\omega) \},
$$with the goal of showing that these bounds can only be satisfied for all $a<b$ when $q=0$.

Comments

One of the things I found confusing about the ergodic theorem is that it is often cited as a kind of uniqueness result: It states, in a certain sense, that all sample paths have the same time-averages.

However, the problem is that when the time shift has several trapping sets, there are multiple probability distributions that make the time shift ergodic. Suppose that $T$ is non-ergodic with respect to some measure $P$; and suppose further that $A$ is a minimal non-trivial trapping set in the sense that it does not itself contain any trapping sets $B$ with $0 < P(B) < P(A) < 1$. Then we can construct a measure with respect to which $T$ will be ergodic, namely, the conditional measure $P(\,\cdot\,|\,A)$.

This observation reflects the fact that if we sample $\omega \sim P$, the sample path we draw may lie in any of the minimal non-trivial trapping sets for $T$. Although it might not be clear based on a finite segment of $\omega$ (because such a segment might not reveal which trapping set we're in), such a sample path will stay inside this minimal trap set forever, and thus converge to the time-average characteristic of that set.

A small complication here is that even a minimal trapping set might in fact contain other trapping sets that are smaller: For instance, the set of coin flipping sequences whose averages converge to 0 are a trapping set, but so is the sequence of all 0s. However, this singleton set will either have positive probability (in which case the limit set is not minimal) or probability 0 (in which case it can never come up as a sample from $P$).

Lastly, suppose the sample path $\omega$ is not sampled from $P$, but from some other measure $Q$ which is not preserved by $T$. Can we say that $A_nQ\rightarrow P$?

Not always, but a sufficient condition for this two be the case is that $Q$ is absolutely continuous with respect to $P$ ($Q \ll P$). If that is the case, then the $Q$-probability of observing a non-converging average will be 0, since the $P$-probability of this event is 0, and $P$ dominates $Q$.

I am not sure whether this condition is also necessary. A degenerate measure $Q$ which places all mass on a single sample path, or even on countably many, can often be detected by some integrable function $f$ such as the indicator function of everything except the points on the countably many countable sample paths.

Wednesday, May 28, 2014

Herdan: The Advanced Theory of Language as Choice and Chance (1966)

This really odd book is based to some extent on the earlier Language as Choice and Chance (1956), but contains, according to the author, a lot of new material. It discusses a variety of topics and language statistics, often in a rather unsystematic and bizarrely directionless way.

The part about "linguistics duality" (Part IV) is particularly confusion and strange, and I'm not sure quite what to make of it. Herdan seems to want to make some great cosmic connection between quantum physics, natural language semantics, and propositional logic.

But leaving that aside, I really wanted to quote it because he so clearly expresses the deterministic philosophy of language — that speech isn't a random phenomenon, but rather has deep roots in free will.

"Human Willfulness"

He thus explains that language has one region which is outside the control of the speaker, but that once we are familiar with this set of restrictions, we can express ourselves within its bounds:

It leaves the individual free to exercise his choice in the remaining features of language, and insofar language is free. The determination of the extent to which the speaker is bound by the linguistic code he uses, and conversely, the extent to which he is free, and can be original, this is the essence of what I call quantitative linguistics. (pp. 5–6)

Similarly, he comments on a study comparing the letter distribution in two texts as follows:

There can be no doubt about the possibility of two distributions of this kind, in any language, being significantly different, if only for the simple reason that the laws of language are always subject, to some extent at least, to human willfulness or choice. By deliberately using words in one text, which happen to be of rather singular morphological structure, it may well be possible to achieve a significant difference of letter frequencies in the two texts. (p. 59)

And finally, in the chapter on the statistical analysis of style, he goes on record completely:

The deterministic view of language regards language as the deliberate choice of such linguistic units as are required for expressing the idea one has in mind. This may be said to be a definition in accordance with current views. Introspective analysis of linguistic expression would seem to show it is a deterministic process, no part of which is left to chance. A possible exception seems to be that we often use a word or expression because it 'happened' to come into our mind. But the fact that memory may have features of accidental happenings does not mean that our use of linguistic forms is of the same character. Supposing a word just happened to come into our mind while we are in the process of writing, we are still free to use it or not, and we shall do one or the other according to what is needed for expressing what we have in mind. It would seem that the cause and effect principle of physical nature has its parallel in the 'reason and consequence' or 'motive and action' principle of psychological nature, part of which is the linguistic process of giving expression to thought. Our motive for using a particular expression is that it is suited better than any other we could think of for expressing our thought. (p. 70)

He then goes on to say that "style" is a matter of self-imposed constraints, so that we are in fact a bit less free to choose our words that it appears, although we ourselves come up with the restrictions.

"Grammar Load"

In Chapter 3.3, he expands these ideas about determinism and free will in language by suggesting that we can quantify the weight of the grammatical restrictions of a language in terms of a "grammar load" statistic. He suggest that this grammar load can be assessed by counting the number of word forms per token in a sample of text from the language (p. 49). He does discuss entropy later in the book (Part III(C)), but doesn't make the connection to redundancy here.

Inspects an English corpus of 78,633 tokens, he thus finds 53,102 different forms and concludes that English has a "grammar load" of

53,102 / 78,633 = 67.53%.

Implicit in this computation is the idea that the number of new word forms grows linearly with the number of tokens you inspect. This excludes sublinear spawn rates such as

logarithmic growth, as in a Chinese Restaurant Process;
logistic growth, as in a biological system with a bounded carrying capacity;
power laws with exponents between 0 and 1 (e.g., the square root function).

In fact, vocabulary size does seem to grow sublinearly as a function of corpus size. The spawn rate for new word forms in thus not constant in English text.

Word forms in the first N tokens in the brown corpus for N = 0 … 100,000.

However, neither of the growth functions listed above fit the data very well (as far as I can see).

Thursday, March 27, 2014

Walters: An Introduction to Ergodic Theory (1982), p. 26

Book cover; from Booktopia.

Section 1.4 of Peter Walters' textbook on ergodic theory contains a proof of Poincaré's recurrence theorem. I found it a little difficult to read, so I'll try to paraphrase the proof here using a vocabulary that might be a bit more intuitive.

The Possible is the Necessarily Possible

The theorem states the following: If

X is a probability space
T: X → X is a measure-preserving transformation of X,
E is an event with positive probability,
and x a point in E,

then the series

x, Tx, T²x, T³x, T⁴x, …

will almost certainly pass through E infinitely often. Or: if it happens once, it will happen again.

The idea behind the proof is to describe the set R of points that visit E infinitely often as the superior limit of a series of sets. This description can then be used to show that E ∩ R has the same measure as E. This will imply that almost all points in E revisit E infinitely often.

Statement and proof of the theorem; scan from page 26 in Walters' book.

I'll try to spell this proof out in more detail below. My proof is much, much longer than Walters', but hopefully this means that it's much, much more readable.

Late Visitors

Let R_i be the set of points in X that visit E after i or more applications of T. We can then make two observations about the series R₀, R₁, R₂, R₃, …:

First, if j > i, and you visit E at a time later than j, you also visit E at a time later than i. The R_i's are consequently nested inside each other:

R₀ ⊃ R₁ ⊃ R₂ ⊃ R₃ ⊃ …

Let's use the name R for the limit of this series (that is, the intersection of the sets). R then consists of all the points in X that visit E infinitely often.

The series of sets is downward converging.

Second, R_i contains the points that visit E at time i or later, and the transformation T^–1 takes us one step back in time. The set T^–1R_i consequently contains the points in X that visit E at time i + 1 or later. Thus

T^–1R_i = R_i_{+ 1}.

But since we have assumed that T is measure-preserving, this implies that

m(R_i) = m(T^–1R_i) = m(R_i_{+ 1}).

By induction, every set in the series thus has the same measure:

m(R₀) = m(R₁) = m(R₂) = m(R₃) = …

Or to put is differently, the discarded parts R₀\R₁, R₁\R₂, R₂\R₃, etc., are all null sets.

Intersection by Intersection

So we have that

in set-theoretic terms, the R_i's converge to a limit R from above;
but all the R_i's have the same measure.

Let's use these facts to show that m(E ∩ R) = m(E), that is, that we only throw away a null set by intersecting E with R.

The event E and the set R of points that visit the E infinitely often.

To prove this, notice first that every point in E visits E after zero applications of T. Thus, R₀ ⊃ E, or in other words, E ∩ R₀ = E. Consequently,

m(E ∩ R₀) = m(E).

We now need to extend this base case by a limit argument to show that

m(E ∩ R) = m(E).

But, as we have seen above, the difference between R₀ and R₁ is a null set. Hence, the difference between E ∩ R₀ and E ∩ R₁ is also a null set, so

m(E ∩ R₀) = m(E ∩ R₁).

This argument holds for any i and i + 1. By induction, we thus get

m(E ∩ R₀) = m(E ∩ R₁) = m(E ∩ R₂) = m(E ∩ R₃) = …

The probability of a visit to E before but never after time i has probability 0.

Since measures respect limits, this implies that

m(E ∩ R₀) = m(E ∩ R).

But we have already seen that m(E ∩ R₀) = m(E), this implies that

m(E ∩ R) = m(E).

The Wherefore

An informal explanation of what's going on in this proof might be the following:

We are interested in the conditional probability of visiting E infinitely often given that we have visited it once, that is, Pr(R | E). In order to compute this probability, we divide up R ∩ E into an infinite number of cases and discover that all but one of them have probability 0.

If you Imagine yourself walking along a sample path, your route will fall in one of the following categories:

you never visit E;
you visit E for the last time at time i = 0;
you visit E for the last time at time i = 1;
you visit E for the last time at time i = 2;
you visit E for the last time at time i = 3;
…
there is no last time — i.e., you visit E infinitely often.

When we condition on E, the first of these cases has probability 0.

In general, the fact that T is measure-preserving guarantees that it is impossible for an event to occur i times without occurring i + 1 times; consequently, all of the following cases also have probability 0.

We thus have to conclude that the last option — infinitely many visits to E — has the same probability as visiting E once, and thus a conditional probability of 1.

Saturday, March 1, 2014

Attneave: Applications of Information Theory to Psychology (1959)

Fred Attneave's book on information theory and psychology is a sober and careful overview of the various ways in which information theory had been applied to psychology (by people like George Miller) by 1959.

Attneave explicitly tries to stay clear of the information theory craze which followed the publication of Shannon's 1948 paper:

Book cover; from Amazon.

Thus presented with a shiny new tool kit and a somewhat esoteric new vocabulary to go with it, more than a few psychologists reacted with an excess of enthusiasm. During the early fifties some of the attempts to apply informational techniques to psychological problems were successful and illuminating, some were pointless, and some were downright bizarre. At present two generalizations may be stated with considerable confidence:
(1) Information theory is not going to provide a ready-made solution to all psychological problems; (2) Employed with intelligence, flexibility, and critical insight, information theory can have great value both in the formulation or certain psychological problems and in the analysis of certain psychological data (pp. v–vi)

Or in other words: Information theory can provide the descriptive statistics, but there is no hiding from the fact that you and you alone are responsible for your model.

Language Only, Please

Chapter 2 of the book is about entropy rates, and about the entropy of English in particular. Attneave talks about various estimation methods, and he discusses Shannon's guessing game and a couple of related studies.

As he sums up the various mathematical estimation tricks, he notes that predictions from statistical tables tend to be more reliable than predictions from human subjects with respect to the first couple of letters of a text. This means that estimates from human predictions will tend to overestimate the unpredictability of the first few letters of a string.

He then comments:

What we are concerned with above is the obvious possibility that calculated values (or rather, brackets) of H_N [= the entropy of letter N given letter 1 through N – 1] will be too high because of the subject's incomplete appreciation of statistical regularities which are objectively present. On the other hand, there is the less obvious possibility that a subject's guesses may, in a certain sense, be too good. Shannon's intent is presumably to study statistical restraints which pertain to language. But a subject given a long sequence of letters which he has probably never encountered before, in that exact pattern, may be expected to base his prediction of the next letter not only upon language statistics, but also upon his general knowledge [p. 40] of the world to which language refers. A possible reply to this criticism is that all but the lowest orders of sequential dependency in language are in any case attributable to natural connections among the referents of words, and that it is entirely legitimate for a human predictor to take advantage of of such natural connections to estimate transitional probabilities of language, even when no empirical frequencies corresponding to the probabilities exist. It is nevertheless important to realize that a human predictor is conceivably superior to a hypothetical "ideal predictor" who knows none of the connections between words and their referents, but who (with unlimited computational facilities) has analyzed all the English ever written and discovered all the statistical regularities residing therein. (pp. 39–40; emphases in original)

I'm not sure that was "Shannon's intent." Attneave seems to rely crucially on an objective interpretation of probability as well as an a priori belief in language as an autonomous object.

Just like Laplace's philosophical commitments became obvious when he starting talking in hypothetical terms, it is also the "ideal predictor" in this quote which reveals the philosophy of language that informs Attneave's perspective.

Subscribe to: Posts ( Atom )