Notebooks on Language: Billsingley: Ergodic Theory and Information (1965), pp. 12

Wednesday, September 16, 2015

Billsingley: Ergodic Theory and Information (1965), pp. 12–14

In the process of proving the ergodic theorem, Billingsley also provides in passing a proof for a weaker theorem related to mixing operations: This states that for a class of processes that satisfy a certain limiting independence condition, visiting frequencies converge to probabilities.

Definitions

More precisely, a time shift $T$ is said to be mixing (p. 12) if$$
P(A \cap T^{-n}B) \; \rightarrow \; P(A)P(B)
$$for all sets $A$ and $B$. The process is thus mixing if information about the present is almost entirely independent of the far future. (A sample path $\omega$ belongs to the set $T^{-n}B$ if $\omega$ will visit $B$ after $n$ time shifts forward.) Billsingley doesn't comment on the choice of the term "mixing," but I assume it should be read as an ill-chosen synonym for "shuffling" in this context.

We will now be interested in the behavior of the visiting frequencies$$
P_n(A) \; := \; \frac{1}{n} \sum_{k=0}^{n-1} \mathbb{I}_A(T^k\omega),
$$where $\mathbb{I}_A$ is the indicator function of some bundles of sample paths. This quantity is a random variable with a distribution defined by the distribution on the sample paths $\omega$.

Note that the traditional notion of visiting frequency (i.e., how often is the ace of spades on top of the deck?) is a special case of this concept of visiting frequency, in which the bundle $A$ is defined entirely in terms of the 0th coordinate of the sample path $\omega = (\ldots, \omega_{-2}, \omega_{-2}, \omega_{0}, \omega_{1}, \omega_{2}, \ldots)$. In the general case, the question of whether $\omega\in A$ could involve information stored at arbitrary coordinates of $\omega$ (today, tomorrow, yesterday, or any other day).

Theorem and Proof Strategy

The mixing theorem now says the following: Suppose that $P$ is a probability measure on the set of sample paths, and suppose that the time shift $T$ is mixing and measure-preserving with respect to this measure. Then$$
P_n(A) \; \rightarrow \; P(A)
$$in probability.

The proof of this claim involves two things:

Showing that $E[P_n(A)] = P(A)$;
Showing that $Var[P_n(A)] \rightarrow 0$ for $n \rightarrow \infty$.

These two assumptions collectively imply convergence in probability (by Markov's inequality).

Identity in Mean

The time shift $T$ is assumed to preserve the measure $P$. We thus have $$
E[f(\omega)] \; = \; E[f(T\omega)] \; = \; E[f(T^2\omega)] \; = \; \cdots
$$for any measurable function $f$. It follows that$$
E[\mathbb{I}_A(\omega)] \; = \;
E[\mathbb{I}_A(T\omega)] \; = \;
E[\mathbb{I}_A(T^2\omega)] \; = \; \cdots
$$and that these are all equal to $P(A)$. By the linearity of expectations, we therefore get that $E[P_n(A)] = P(A)$.

This proves that $P_n(A)$ at least has the right mean. Convergence to any other value than $P(A)$ is therefore out of the question, and it now only remains to be seen that the variance of $P_n(A)$ also goes to 0 as $n$ goes to infinity.

Vanishing Variance: The Idea

Consider therefore the variance$$
Var[P_n(A)] \; = \; E\left[ \left(P_n(A) - P(A)\right)^2 \right].
$$In order to expand the square under this expectation, it will be helpful to break the term $P(A)$ up into $n$ equal chunks and then write the whole thing as a sum:$$
P_n(A) - P(A) \; = \; \frac{1}{n} \sum_{k=0}^{n-1} (\mathbb{I}_A(T^k\omega) - P(A)).
$$If we square of this sum of $n$ terms and take the expectation of this expasion, we will get a sum of $n^2$ cross-terms, each of which are expected values of the form$$
Cov(i,j) \; := \; E\left[ \left(\mathbb{I}_A(T^i\omega) - P(A)\right) \times \left (\mathbb{I}_A(T^j\omega) - P(A)\right) \right].
$$By expanding this product and using that $E[\mathbb{I}_A(T^k\omega)]=P(A)$, we can find that$$
Cov(i,j) \; = \; E\left[\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)\right] - P(A)^2.
$$The random variable $\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)$ takes the value 1 on $\omega$ if and only if $\omega$ will visit $A$ after both $i$ and $j$ time steps. Hence,$$
Cov(i,j) \; = \; P(T^{-i}A \cap T^{-j}A) - P(A)^2,
$$and we can write$$
Var[P_n(A)] \; = \; \frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j).
$$The goal will now be to show the sum of these $n \times n$ converges to $0$ as we increase $n$.

Vanishing Variance: The Steps

Since $T$ is assumed to be measure-preserving, the value of $Cov(i,j)$ is completely determined by the distance between $i$ and $j$, not their absolute values. We thus have$$
Cov(i,j) \; = \; Cov(n+i, n+j)
$$for all $i$, $j$, and $n$.

By the mixing assumption, $Cov(0,j) \rightarrow 0$ for $j \rightarrow \infty$. Together with the observation about $|i-j|$, this implies that $$Cov(i,j) \;\rightarrow\; 0$$ whenever $|i-j| \rightarrow \infty$. If we imagine the values of $Cov(i,j)$ tabulated in a huge, square grid, the numbers far away from the diagonal are thus tiny. We will now sum up the value of these $n^2$ numbers $Cov(i,j)$ by breaking them into $n$ classes according to the value of $d = |i-j|$.

To do so, we upper-bound the number of members in each class by $2n$: There are for instance $n$ terms with $|i-j|=0$, and $2(n-1)$ terms with $|i-j|=1$. (This upper-bound corresponds graphically to covering the $n\times n$ square with a tilted $n\sqrt{2} \times n\sqrt{2}$ square, except for some complications surrounding $d=0$.)

We thus have$$
Var[P_n(A)]
\; = \;
\frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j)
\; \leq \;
\frac{2}{n} \sum_{d=0}^{n-1} Cov(0,d).
$$This last quantity is an average of a list of numbers that converge to $0$; this average itself must therefore also converge to $0$ as $n\rightarrow \infty$. This proves that$$
Var[P_n(A)] \; \rightarrow \; 0]
$$for $n \rightarrow \infty$, and we can conclude that $P_n(A) \rightarrow P(A)$ in probability.

Applicability

Note that this mixing assumption is very strong. A periodic Markov chain, for instance, will usually not be mixing: Consider for instance a process that alternates between odd and even values; the the joint probability $P(evens \cap T^n odds)$ will alternate between 0 and 1 as $n$ runs through the natural numbers. In many cases of practical interest, we therefore do in fact need the increased generality of the classic ergodic theorem.

Notebooks on Language