Processing math: 100%

Wednesday, September 16, 2015

Billsingley: Ergodic Theory and Information (1965), pp. 12–14

In the process of proving the ergodic theorem, Billingsley also provides in passing a proof for a weaker theorem related to mixing operations: This states that for a class of processes that satisfy a certain limiting independence condition, visiting frequencies converge to probabilities.

Definitions

More precisely, a time shift T is said to be mixing (p. 12) ifP(ATnB)P(A)P(B)for all sets A and B. The process is thus mixing if information about the present is almost entirely independent of the far future. (A sample path ω belongs to the set TnB if ω will visit B after n time shifts forward.) Billsingley doesn't comment on the choice of the term "mixing," but I assume it should be read as an ill-chosen synonym for "shuffling" in this context.

We will now be interested in the behavior of the visiting frequenciesPn(A):=1nn1k=0IA(Tkω),where IA is the indicator function of some bundles of sample paths. This quantity is a random variable with a distribution defined by the distribution on the sample paths ω.

Note that the traditional notion of visiting frequency (i.e., how often is the ace of spades on top of the deck?) is a special case of this concept of visiting frequency, in which the bundle A is defined entirely in terms of the 0th coordinate of the sample path ω=(,ω2,ω2,ω0,ω1,ω2,). In the general case, the question of whether ωA could involve information stored at arbitrary coordinates of ω (today, tomorrow, yesterday, or any other day).

Theorem and Proof Strategy

The mixing theorem now says the following: Suppose that P is a probability measure on the set of sample paths, and suppose that the time shift T is mixing and measure-preserving with respect to this measure. ThenPn(A)P(A)in probability.

The proof of this claim involves two things:
  1. Showing that E[Pn(A)]=P(A);
  2. Showing that Var[Pn(A)]0 for n.
These two assumptions collectively imply convergence in probability (by Markov's inequality).

Identity in Mean

The time shift T is assumed to preserve the measure P. We thus have E[f(ω)]=E[f(Tω)]=E[f(T2ω)]=for any measurable function f. It follows thatE[IA(ω)]=E[IA(Tω)]=E[IA(T2ω)]=and that these are all equal to P(A). By the linearity of expectations, we therefore get that E[Pn(A)]=P(A).

This proves that Pn(A) at least has the right mean. Convergence to any other value than P(A) is therefore out of the question, and it now only remains to be seen that the variance of Pn(A) also goes to 0 as n goes to infinity.

Vanishing Variance: The Idea

Consider therefore the varianceVar[Pn(A)]=E[(Pn(A)P(A))2].In order to expand the square under this expectation, it will be helpful to break the term P(A) up into n equal chunks and then write the whole thing as a sum:Pn(A)P(A)=1nn1k=0(IA(Tkω)P(A)).If we square of this sum of n terms and take the expectation of this expasion, we will get a sum of n2 cross-terms, each of which are expected values of the formCov(i,j):=E[(IA(Tiω)P(A))×(IA(Tjω)P(A))].By expanding this product and using that E[IA(Tkω)]=P(A), we can find thatCov(i,j)=E[IA(Tiω)×IA(Tjω)]P(A)2.The random variable IA(Tiω)×IA(Tjω) takes the value 1 on ω if and only if ω will visit A after both i and j time steps. Hence,Cov(i,j)=P(TiATjA)P(A)2,and we can writeVar[Pn(A)]=1n2n1i=0n1j=0Cov(i,j).The goal will now be to show the sum of these n×n converges to 0 as we increase n.

Vanishing Variance: The Steps

Since T is assumed to be measure-preserving, the value of Cov(i,j) is completely determined by the distance between i and j, not their absolute values. We thus haveCov(i,j)=Cov(n+i,n+j)for all i, j, and n.

By the mixing assumption, Cov(0,j)0 for j. Together with the observation about |ij|, this implies that Cov(i,j)0 whenever |ij|. If we imagine the values of  Cov(i,j) tabulated in a huge, square grid, the numbers far away from the diagonal are thus tiny. We will now sum up the value of these n2 numbers Cov(i,j) by breaking them into n classes according to the value of d=|ij|.

To do so, we upper-bound the number of members in each class by 2n: There are for instance n terms with |ij|=0, and 2(n1) terms with |ij|=1. (This upper-bound corresponds graphically to covering the n×n square with a tilted n2×n2 square, except for some complications surrounding d=0.)

We thus haveVar[Pn(A)]=1n2n1i=0n1j=0Cov(i,j)2nn1d=0Cov(0,d).This last quantity is an average of a list of numbers that converge to 0; this average itself must therefore also converge to 0 as n. This proves thatVar[Pn(A)]0]for n, and we can conclude that Pn(A)P(A) in probability.

Applicability

Note that this mixing assumption is very strong. A periodic Markov chain, for instance, will usually not be mixing: Consider for instance a process that alternates between odd and even values; the the joint probability P(evensTnodds) will alternate between 0 and 1 as n runs through the natural numbers. In many cases of practical interest, we therefore do in fact need the increased generality of the classic ergodic theorem.

No comments :

Post a Comment