I'm reading this book (now for the second time) for its proof of Birkhoff's ergodic theorem. I was held up by lots of technical details the first time around, but I've gotten a bit further now.
Definitions
The set-up is the following: A set
X is given along with a time shift
T: X\rightarrow X. We further have a probability distribution on the set of sample paths
\Omega=X^{\infty}. For each integrable function
f: \Omega\rightarrow \mathbb{R}, we consider the
time-average
A_n f(\omega) \;=\; \frac{1}{n} \sum_{i=0}^{n-1} f(T^i \omega).
We are interested in investigating whether such averages converge as
n\rightarrow\infty. If so, we would also like to know whether the limit is the same on all sample paths.
Notice that
f is a function of the whole sample path, not just of a single coordinate. The function
f could for instance measure the difference between two specific coordinates, or it could return a limiting frequency associated with the sample path.
If
f has the same value on the entire orbit
\omega,\, T\omega,\, T^2\omega,\, T^3\omega,\, \ldotswe say that the function
f is
time-invariant. Global properties like quantifications or tail events are time-invariant.
A set
A is similarly
time-invariant if
TA=A. (I like to call such invariants as "trapping sets.") This is a purely set-theoretic notion, and doesn't involve the probability measure on the sample paths. However, we say that an invariant set
A is a
non-trivial if
0 < P(A) < 1.
Statement
One of the things that are confusing about the so-called ergodic theorem is that it actually doesn't involve ergodicity very centrally. Instead it makes the following statement:
- Suppose that the time shift T is measure-preserving in the sense that P(A)=P(TA) for all measurable sets A. Then the time-averages of any integrable function converges almost everywhere:P\left\{\omega : \lim_{n\rightarrow \infty} A_n f(\omega) \textrm{ exists} \right\} \;=\; 1.
- Suppose in addition that the time shift is ergodic in the sense that all its trapping are trivial with respect to P. Then the time averages are all the same on a set with P-probability 1:P \left\{(\omega_1, \omega_2) : \lim_{n\rightarrow\infty} A_n f(\omega_1) \,=\, \lim_{n\rightarrow\infty} A_n f(\omega_2) \right\} \;=\; 1.
Almost all the mathematical energy is spent proving the first part of this theorem. The second part is just a small addendum: Suppose a function
f has time averages which are not almost everywhere constant. Then the threshold set
\{A_n f \leq \tau\}
must have a probability strictly between 0 and 1 for some threshold
\tau. (This is a general feature of random variables that are not a.e. constant.)
But since the limiting time-average is a time-invariant property (stating something about the right-hand tail of the sample path), this thresholding set is an invariant set. Thus, if the limiting time-averages are not a.e. unique, the time shift has a non-trivial trapping set.
So again, the heavy lifting lies in proving that the time-averages actually converge on all sample paths when the time shift is measure-preserving. Billingsley gives two proofs of this, one following idea by von Neumann, and one following Birkhoff.
The Von Neumann Proof
Let's say that a function is "flat" if it has converging time-averages. We can then summarize von Neumann's proof along roughly these lines:
- Time-invariant functions are always flat.
- Increment functions of the form s(\omega) \;=\; g(\omega) - g(T\omega), where g is any function, are also flat.
- Flatness is preserved by taking linear combinations.
- Any square-integrable function can be obtained as the L^2 limit of (linear combinations of) invariant functions and increment functions.
- Moreover, the L^2 limit of a sequence of flat L^2 functions is flat; in other words, flatness is preserved by limits.
- Hence, all L^2 functions are flat.
This argument seems somewhat magical at first, but I think more familiarity with the concept of
orthogonality and
total sets in the
L^p Hilbert spaces could make it seem more natural.
Just as an example of how this might work, consider the coordinate function
f(\omega) = X_0(\omega). In order to obtain this function as a limit of our basis functions, fix a decay rate
r < 1 and consider the function
g(\omega) \;=\; X_0(\omega) + r X_1(\omega) + r^2 X_2(\omega) + r^3 X_3(\omega) + \cdots
Then for
r \rightarrow 1, we get a converging series of increment functions,
g(\omega) - g(T\omega) \;\rightarrow\; X_0(\omega). If the coordinates are bounded (as in a Bernoulli process), this convergence is uniform across all sample paths. Billingsley uses an abstract and non-constructive orthogonality argument to show that such decompositions are always possible.
In order to show that the limit of flat functions is also flat, suppose that
f_1, f_2, f_3, \ldots are all flat, and that
f_k \rightarrow f^*. We can then prove that the averages
A_1f^*, A_2f^*, A_3f^*, \ldots form a Cauchy sequence by using the approximation
f^* \approx f_k to show that
| A_n f^* - A_m f^* | \;\approx\; | A_n f_k - A_m f_k | \;\rightarrow\; 0. Of course, the exact argument needs a bit more detail, but this is the idea.
After going through this
L^2 proof, Billingsley explains how to extend the proof to all of
L^1. Again, this is a limit argument, showing that the flatness is preserved as we obtain the
L^1 functions as limits of
L^2 functions.
The Birkhoff Proof
The second proof follows
Birkhoff's 1931 paper. The idea is here to derive the the ergodic theorem from the so-called
maximal ergodic theorem (which Birkhoff just calls "the lemma"). This theorem states that if
T preserves the measure
P, then
P\left\{ \exists n: A_n f > \varepsilon \right\}
\;\leq\;
\frac{1}{\varepsilon} \int_{\{f>\varepsilon\}} f \;dP.
I am not going to go through the proof, but here a few observations that might give a hint about how to interpret this theorem:
- For any arbitrary function, we have the inequality P(f>\varepsilon) \;\leq\; \frac{1}{\varepsilon} \int_{\{f>\varepsilon\}}f\;dP.This is the generalization of Markov's inequality to random variables that are not necessarily positive.
- If f is time-invariant, then A_n f = f, because the average A_n f then consists of n copies of the number f(\omega) \;=\; f(T\omega) \;=\; f(T^2\omega) \;=\; \cdots \;=\; f(T^{n-1}\omega). In this case, the theorem thus follows from the generalized Markov bound above.
- Even though we do not assume that f(\omega)=f(T\omega) for a fixed sample path \omega, all distributional statements we can make about f(\omega) also apply to f(T\omega) because T is assumed to be measure-preserving; for instance, the two functions have the same mean when \omega \sim P. This fact of course plays a crucial part in the proof.
- By putting (f,\varepsilon) := (-f,-\varepsilon), we see that the maximal ergodic theorem is equivalent toP\left\{ \exists n: A_n f < \varepsilon \right\}
\;\leq\;
\frac{1}{\varepsilon} \int_{\{f<\varepsilon\}} f \;dP.
The maximal ergodic theorem can be used to prove the general (or "pointwise") ergodic theorem. The trick is to find both an upper and a lower bound on the non-convergence probability
q \;=\; P\{ \omega :\, \liminf A_n f(\omega) < a < b < \limsup A_n f(\omega) \},
with the goal of showing that these bounds can only be satisfied for all
a<b when
q=0.
Comments
One of the things I found confusing about the ergodic theorem is that it is often cited as a kind of uniqueness result: It states, in a certain sense, that all sample paths have the same time-averages.
However, the problem is that when the time shift has several trapping sets, there are multiple probability distributions that make the time shift ergodic. Suppose that
T is non-ergodic with respect to some measure
P; and suppose further that
A is a minimal non-trivial trapping set in the sense that it does not itself contain any trapping sets
B with
0 < P(B) < P(A) < 1. Then we can construct a measure with respect to which
T will be ergodic, namely, the conditional measure
P(\,\cdot\,|\,A).
This observation reflects the fact that if we sample
\omega \sim P, the sample path we draw may lie in any of the minimal non-trivial trapping sets for
T. Although it might not be clear based on a finite segment of
\omega (because such a segment might not reveal which trapping set we're in), such a sample path will stay inside this minimal trap set forever, and thus converge to the time-average characteristic of that set.
A small complication here is that even a minimal trapping set might in fact contain other trapping sets that are smaller: For instance, the set of coin flipping sequences whose averages converge to 0 are a trapping set, but so is the sequence of all 0s. However, this singleton set will either have positive probability (in which case the limit set is not minimal) or probability 0 (in which case it can never come up as a sample from
P).
Lastly, suppose the sample path
\omega is not sampled from
P, but from some other measure
Q which is not preserved by
T. Can we say that
A_nQ\rightarrow P?
Not always, but a sufficient condition for this two be the case is that
Q is absolutely continuous with respect to
P (
Q \ll P). If that is the case, then the
Q-probability of observing a non-converging average will be 0, since the
P-probability of this event is 0, and
P dominates
Q.
I am not sure whether this condition is also necessary. A degenerate measure
Q which places all mass on a single sample path, or even on countably many, can often be detected by some integrable function
f such as the indicator function of everything except the points on the countably many countable sample paths.