## Thursday, September 3, 2015

### Billingsley: Ergodic Theory and Information (1965)

I'm reading this book (now for the second time) for its proof of Birkhoff's ergodic theorem. I was held up by lots of technical details the first time around, but I've gotten a bit further now.

### Definitions

The set-up is the following: A set $X$ is given along with a time shift $T: X\rightarrow X$. We further have a probability distribution on the set of sample paths $\Omega=X^{\infty}$. For each integrable function $f: \Omega\rightarrow \mathbb{R}$, we consider the time-average$$A_n f(\omega) \;=\; \frac{1}{n} \sum_{i=0}^{n-1} f(T^i \omega).$$We are interested in investigating whether such averages converge as $n\rightarrow\infty$. If so, we would also like to know whether the limit is the same on all sample paths.

Notice that $f$ is a function of the whole sample path, not just of a single coordinate. The function $f$ could for instance measure the difference between two specific coordinates, or it could return a limiting frequency associated with the sample path.

If $f$ has the same value on the entire orbit $$\omega,\, T\omega,\, T^2\omega,\, T^3\omega,\, \ldots$$we say that the function $f$ is time-invariant. Global properties like quantifications or tail events are time-invariant.

A set $A$ is similarly time-invariant if $TA=A$. (I like to call such invariants as "trapping sets.") This is a purely set-theoretic notion, and doesn't involve the probability measure on the sample paths. However, we say that an invariant set $A$ is a non-trivial if $0 < P(A) < 1$.

### Statement

One of the things that are confusing about the so-called ergodic theorem is that it actually doesn't involve ergodicity very centrally. Instead it makes the following statement:
1. Suppose that the time shift $T$ is measure-preserving in the sense that $P(A)=P(TA)$ for all measurable sets $A$. Then the time-averages of any integrable function converges almost everywhere:$$P\left\{\omega : \lim_{n\rightarrow \infty} A_n f(\omega) \textrm{ exists} \right\} \;=\; 1.$$
2. Suppose in addition that the time shift is ergodic in the sense that all its trapping are trivial with respect to $P$. Then the time averages are all the same on a set with $P$-probability 1:$$P \left\{(\omega_1, \omega_2) : \lim_{n\rightarrow\infty} A_n f(\omega_1) \,=\, \lim_{n\rightarrow\infty} A_n f(\omega_2) \right\} \;=\; 1.$$
Almost all the mathematical energy is spent proving the first part of this theorem. The second part is just a small addendum: Suppose a function $f$ has time averages which are not almost everywhere constant. Then the threshold set$$\{A_n f \leq \tau\}$$must have a probability strictly between 0 and 1 for some threshold $\tau$. (This is a general feature of random variables that are not a.e. constant.)

But since the limiting time-average is a time-invariant property (stating something about the right-hand tail of the sample path), this thresholding set is an invariant set. Thus, if the limiting time-averages are not a.e. unique, the time shift has a non-trivial trapping set.

So again, the heavy lifting lies in proving that the time-averages actually converge on all sample paths when the time shift is measure-preserving. Billingsley gives two proofs of this, one following idea by von Neumann, and one following Birkhoff.

### The Von Neumann Proof

Let's say that a function is "flat" if it has converging time-averages. We can then summarize von Neumann's proof along roughly these lines:
1. Time-invariant functions are always flat.
2. Increment functions of the form $$s(\omega) \;=\; g(\omega) - g(T\omega),$$ where $g$ is any function, are also flat.
3. Flatness is preserved by taking linear combinations.
4. Any square-integrable function can be obtained as the $L^2$ limit of (linear combinations of) invariant functions and increment functions.
5. Moreover, the $L^2$ limit of a sequence of flat $L^2$ functions is flat; in other words, flatness is preserved by limits.
6. Hence, all $L^2$ functions are flat.
This argument seems somewhat magical at first, but I think more familiarity with the concept of orthogonality and total sets in the $L^p$ Hilbert spaces could make it seem more natural.

Just as an example of how this might work, consider the coordinate function $f(\omega) = X_0(\omega)$. In order to obtain this function as a limit of our basis functions, fix a decay rate $r < 1$ and consider the function$$g(\omega) \;=\; X_0(\omega) + r X_1(\omega) + r^2 X_2(\omega) + r^3 X_3(\omega) + \cdots$$Then for $r \rightarrow 1$, we get a converging series of increment functions, $g(\omega) - g(T\omega) \;\rightarrow\; X_0(\omega)$. If the coordinates are bounded (as in a Bernoulli process), this convergence is uniform across all sample paths. Billingsley uses an abstract and non-constructive orthogonality argument to show that such decompositions are always possible.

In order to show that the limit of flat functions is also flat, suppose that $f_1, f_2, f_3, \ldots$ are all flat, and that $f_k \rightarrow f^*$. We can then prove that the averages $A_1f^*, A_2f^*, A_3f^*, \ldots$ form a Cauchy sequence by using the approximation $f^* \approx f_k$ to show that $$| A_n f^* - A_m f^* | \;\approx\; | A_n f_k - A_m f_k | \;\rightarrow\; 0.$$ Of course, the exact argument needs a bit more detail, but this is the idea.

After going through this $L^2$ proof, Billingsley explains how to extend the proof to all of $L^1$. Again, this is a limit argument, showing that the flatness is preserved as we obtain the $L^1$ functions as limits of $L^2$ functions.

### The Birkhoff Proof

The second proof follows Birkhoff's 1931 paper. The idea is here to derive the the ergodic theorem from the so-called maximal ergodic theorem (which Birkhoff just calls "the lemma"). This theorem states that if $T$ preserves the measure $P$, then$$P\left\{ \exists n: A_n f > \varepsilon \right\} \;\leq\; \frac{1}{\varepsilon} \int_{\{f>\varepsilon\}} f \;dP.$$I am not going to go through the proof, but here a few observations that might give a hint about how to interpret this theorem:
1. For any arbitrary function, we have the inequality $$P(f>\varepsilon) \;\leq\; \frac{1}{\varepsilon} \int_{\{f>\varepsilon\}}f\;dP.$$This is the generalization of Markov's inequality to random variables that are not necessarily positive.
2. If $f$ is time-invariant, then $A_n f = f$, because the average $A_n f$ then consists of $n$ copies of the number $$f(\omega) \;=\; f(T\omega) \;=\; f(T^2\omega) \;=\; \cdots \;=\; f(T^{n-1}\omega).$$ In this case, the theorem thus follows from the generalized Markov bound above.
3. Even though we do not assume that $f(\omega)=f(T\omega)$ for a fixed sample path $\omega$, all distributional statements we can make about $f(\omega)$ also apply to $f(T\omega)$ because $T$ is assumed to be measure-preserving; for instance, the two functions have the same mean when $\omega \sim P$. This fact of course plays a crucial part in the proof.
4. By putting $(f,\varepsilon) := (-f,-\varepsilon)$, we see that the maximal ergodic theorem is equivalent to$$P\left\{ \exists n: A_n f < \varepsilon \right\} \;\leq\; \frac{1}{\varepsilon} \int_{\{f<\varepsilon\}} f \;dP.$$
The maximal ergodic theorem can be used to prove the general (or "pointwise") ergodic theorem. The trick is to find both an upper and a lower bound on the non-convergence probability
$$q \;=\; P\{ \omega :\, \liminf A_n f(\omega) < a < b < \limsup A_n f(\omega) \},$$with the goal of showing that these bounds can only be satisfied for all $a<b$ when $q=0$.

One of the things I found confusing about the ergodic theorem is that it is often cited as a kind of uniqueness result: It states, in a certain sense, that all sample paths have the same time-averages.

However, the problem is that when the time shift has several trapping sets, there are multiple probability distributions that make the time shift ergodic. Suppose that $T$ is non-ergodic with respect to some measure $P$; and suppose further that $A$ is a minimal non-trivial trapping set in the sense that it does not itself contain any trapping sets $B$ with $0 < P(B) < P(A) < 1$. Then we can construct a measure with respect to which $T$ will be ergodic, namely, the conditional measure $P(\,\cdot\,|\,A)$.

This observation reflects the fact that if we sample $\omega \sim P$, the sample path we draw may lie in any of the minimal non-trivial trapping sets for $T$. Although it might not be clear based on a finite segment of $\omega$ (because such a segment might not reveal which trapping set we're in), such a sample path will stay inside this minimal trap set forever, and thus converge to the time-average characteristic of that set.

A small complication here is that even a minimal trapping set might in fact contain other trapping sets that are smaller: For instance, the set of coin flipping sequences whose averages converge to 0 are a trapping set, but so is the sequence of all 0s. However, this singleton set will either have positive probability (in which case the limit set is not minimal) or probability 0 (in which case it can never come up as a sample from $P$).

Lastly, suppose the sample path $\omega$ is not sampled from $P$, but from some other measure $Q$ which is not preserved by $T$. Can we say that $A_nQ\rightarrow P$?

Not always, but a sufficient condition for this two be the case is that $Q$ is absolutely continuous with respect to $P$ ($Q \ll P$). If that is the case, then the $Q$-probability of observing a non-converging average will be 0, since the $P$-probability of this event is 0, and $P$ dominates $Q$.

I am not sure whether this condition is also necessary. A degenerate measure $Q$ which places all mass on a single sample path, or even on countably many, can often be detected by some integrable function $f$ such as the indicator function of everything except the points on the countably many countable sample paths.