Showing posts with label measure theory. Show all posts
Showing posts with label measure theory. Show all posts

Wednesday, September 9, 2015

Billingsley: Probability and Measure, Ch. 1.4

Prompted by what I found to be quite confusing about Patrick Billingsley's presentation of Kolmogorov's zero-one law, I've been reflecting a bit on the essence of the proof, and I think I've come to a deeper understand of the core of the issue now.

The theorem can be stated as follows: Let $S$ be an infinite collection of events equipped with a probability measure $P$ (which, by Kolmogorov's extension theorem, can be defined exhaustively on the finite subsets of $S$). Suppose further that $A$ is some event in the $\sigma$-extension of $S$ which is independent of any finite selection of events from $S$. Then $A$ either has probability $P(A)=0$ or $P(A)=1$.

The proof is almost evident from this formulation: Since $A$ is independent of any finite selection of events from $S$, the measures $P(\,\cdot\,)$ and $P(\,\cdot\,|\,A)$ coincide on all events that can be defined in terms of a finite number of events from $S$. But by Kolmogorov's extension theorem, this means that the conditional and the unconditional measures extend the same way to the infinite collection. Hence, if P$(A)>0$, this equality also applies to $A$, and thus $P(A\,|\,A)=P(A)$. This implies that $P(A)^2=P(A)$, which is only satisfied by $P(A)=1$.

So the core of the proof is that independence of finite selections implies independence in general. What makes Billingsley's discussion of the theorem appear a bit like black magic is that he first goes through a series of steps to define independence in the infinite case before he states the theorem. But this makes things more murky than they are in Kolmogorov's own statement of the theorem, and it hides the crucial limit argument at the heart of the proof.


An example of a "tail event" which is independent of all finite evidence is the occurrence of infinitely many of the events $A_1, A_2, A_3, \ldots$. The reasons that this is independent of an event $B \in \sigma(A_1, A_2, \ldots, A_N)$ is that$$
\forall x\geq 1 \, \exists y\geq x \, : A_y
$$is logically equivalent to$$
\forall x\geq N \, \exists y\geq x \, : A_y.
$$Conditioning on this some initial segment thus does not change the probability of this event.

Note, however, that this is not generally the case for events of the type$$
\forall x\geq 1 \, \exists y\geq 1 \, : A_y.
$$It is only the "$y\geq x$" in the previous case that ensures the equivalence.

A statement like "For all $i$, $X_i$ will be even" is for instance not a tail event, since a finite segment can show a counterexample, e.g., $X_1=17$. Crucially, however, this example fails to be a tail event because the events "$X_i is even$" (the inner quantification) can be written as a disjunctions of finitely many simple events. We can thus give a counterexample to the outer quantification ($\forall i$) by exhibiting a single $i$ for which the negation of "$X_i$ is even" (which is a universal formula) is checkable in finite time.

Reversely, if this were not the case for any of the statements in the inner loop, the event would be a tail event. That is, if the universal quantification were over a list of events which had no upper limit on the potential index of the verifier, then finite data could not falsify the statement. This happens when the existence of a single potential verifier implies the existence of infinitely many (as it does in the case of "infinitely often" statements, since any larger $y$ is an equally valid candidate verifier).

Events of the form $\exists y: A_y$ are also not tail events, since they are not independent of the counterexamples $\neg A_1, \neg A_2, \neg A_3\ldots$. They are, however, independent of any finite selection of positive events (which do not entail the negation of anything on the list).

We thus have a situation in which sets at the two lowest levels of the Borel hierarchy can have probabilities of any value in $[0,1]$, but as soon as we progress beyond that in logical complexity, only the values 0 and 1 are possible.

Illustration by Yoni Rozenshein.

Oddly enough, this means that no probability theory is possible on complex formulas: When a probability measure is defined in terms of a set of simple events, then the $\Delta_2$ events always have probability 0 or 1. This property is conserved at higher orders in the hierarchy, since the quantifications that push us up the hierarchy are countable (and a countable union of null sets is a null set, by the union bound).

Note also that$$
\lim_{N\rightarrow \infty} \cup_{i=1}^{N} \cap_{j=1}^{N} A_{ij}
\;=\;  \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij},
$$and since $$
P\left( \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij} \right)  \;\in\;  \{0,1\},
$$the probabilities of the finite approximations must converge to these extremes. As we spend more computational power checking the truth of a tail event on a specific sample, we thus get estimates that approach 0 or 1 (although without any general guarantees of convergence speed).

This sheds some light on what the theorem actually means in practice, and how it relates to the zero-one theorem for finite models of first-order logic.

Tuesday, May 5, 2015

Billingsley: Probability and Measure (1979), ch. 2

In the chapter on probability measures, Patrick Billingsley points out a potential fallacy that I can see that I have been making (pp. 24–25).

I have been thinking about the $\sigma$-field generated by a system $S$ as the limit of a series of $\sigma$-operations such as complements and countable unions. Such a recursive procedure for generating $\sigma(S)$ fails to generate all the required sets when, as I tacitly imagined, it is applied only countably many times.

This can be seen by means of a clever diagonal argument which involves constructing a bijection between the sets thus generated and the real unit interval. The gist of the proof is that when we only use countably many recursive steps, we can enumerate the resulting sets by means of sequences of integers, and this leaves the construction open to a diagonal counterexample.

The Playing Field

Let's first agree on a canonical defition of the Borel field $\mathcal{B}$ over $(0,1]$: For the present purposes, we will identify it as the $\sigma$-field generated by the intervals with rational endpoints, including the empty set.

This seed set $S$ is countable and can therefore be enumerated, for instance as
$\emptyset$, $(0,1]$, $(0,\frac{1}{2}]$, $(\frac{1}{2},1]$, $(0,\frac{1}{3}]$, $(\frac{1}{3},\frac{1}{2}]$, $(\frac{1}{3},1]$, $(0,\frac{1}{4}]$, $(\frac{1}{4},\frac{1}{3}]$, $(\frac{1}{4},\frac{1}{2}]$, …
For the purposes of this argument, let's further restrict the permitted $\sigma$-operations to
  1. complementation;
  2. countable unions.
These two operations can compactly be described as an application of the function
$$
\Psi(A_1,A_2,A_3,A_4,\ldots)\;=\; A_1^c \cup A_2 \cup A_3 \cup A_4 \cup \ldots
$$to a countable sequence of sets. We can then translate
\begin{eqnarray}
A^c &=& A^c \cup \emptyset \cup \emptyset \cup \emptyset \cup \emptyset \ldots \\
\cup_{i=1}^{\infty}A_i &=& \emptyset \cup A_1 \cup A_2 \cup A_3 \cup \ldots
\end{eqnarray}Any $\sigma$-operation is then a matter of applying $\Psi$ to some countable sequence of sets.

The Elements of Construction

This allows us to describe our inductive procedure in terms of a recursive function $R$ that uses $\Psi$ as a subroutine. We define the base case as a projection:
$$
R_0(A_1,A_2,A_3,\ldots) \;=\; A_1.
$$In order to define the recursion step, we first need to recall that $\mathbb{N}$ and $\mathbb{N}\times \mathbb{N}$ are equipotent, as can be seen from the following array:
$$
\begin{array}{ccccc}
1 & 2 & 4 & 7 & \ldots \\
3 & 5 & 8 & 12 & \ldots \\
6 & 9 & 13 & 18 & \ldots \\
10 & 14 & 19 & 25 & \ldots \\
\vdots & \vdots & \vdots & \vdots &
\end{array}

$$This array can be used to map a countable sequences of sets onto a countable sequence of countable sequences. We use this unraveling trick for the recursive step:
$$
R_n(A_1,A_2,\ldots) \;=\; \Psi( R_{n-1}(A_1,A_2,A_4,\ldots), R_{n-1}(A_3,A_5,A_8,\ldots), \ldots )
$$This way, we can, by a careful selection of rational intervals $A_1,A_2,A_3,\ldots$, select any countable sequence of $R_{n-1}$-sets we like and feed that sequence into a new $\Psi$-construction.

This trick, then, transforms a sequence of rational intervals into an $R_n$-set. In other words, any level $n$ set can be obtained in a (nominally) single step by feeding an appropriate sequence of level 0 sets into the recursive function $R_n$.

From Sequences to Sets

We now finally put
$$
R(A_1,A_2,\ldots) \;=\; R_1(A_1,A_2,\ldots) \cup R_2(A_3,A_5,\ldots) \cup R_3(A_6,A_9,\ldots) \cup \ldots
$$This format covers all sets that can be obtained by any finite level of recursion.

For instance, if a set can be produced by level 17 recursion, we can choose the sequence $A_1,A_2,A_3,\ldots$ so that all $R_i$ except $R_{17}$ return the empty set, while $R_{17}$ returns the set we are interested in. This is achieved by constructing the sequence $A_1,A_2,A_3,\ldots$ so that it splits into just the right subsequences.

In other words, we now have a method for picking out any set that requires only finite recursion by specifying an appropriate sequence of rational intervals. Since there are only countably many rational intervals, the selection of a specific finite-level recursive set is therefore a matter of selecting an appropriate sequence of natural numbers.

Stone to Stone

The set of countably infinite sequences of positive integers is equipotent with the real unit interval.

Specifically, Billingsley suggests that we use the bijection $b$ which for its $k$th integer takes the distance between the $(k-1)$st and the $k$th occurrence of the digit 1 in the binary expansion of the number $\omega\in(0,1]$.

Under this bijection,
$$
b\left(\frac{1}{3}\right) \;=\; b\left(.010101\ldots\right) \;=\; (2,2,2,2,\ldots).
$$For this definition to work, there has to be infinitely many 1s in the binary expansion. This can be guaranteed by always choosing the non-terminating binary expansion of $\omega$, so that, for instance,
$$
b\left(\frac{1}{2}\right) \;=\; b\left(.111111\ldots\right) \;=\; (1,1,1,1,\ldots).
$$Incidentally, this choice also fixes the ambiguity problems we might otherwise have, making $b$ a true bijection.

Because of the construction above, the output of this function can be fed into another function that translates the infinite sequences of integers into an infinite sequence of rational intervals, which is then translated into a recursive set. In other words, we can map any real number $\omega\in(0,1]$ onto a finite-recursion set $s(\omega)$. This establishes a surjection of the real numbers into the system of sets that can be obtained by finitely many $\sigma$-operations on the rational intervals.

The Diagonal

The set we now need for our diagonal counterargument is
$$
B = \{\omega\in (0,1]: \omega \not\in s(\omega)\}.
$$This set cannot be obtained by finite application of $\sigma$-operations to the rational intervals.

If it could, there would be at least one $\omega_0\in (0,1]$ such that $B=s(\omega_0)$. But by construction, $\omega_0\in s(\omega_0)$ holds if and only if $B\not \in s(\omega_0)$; so $B$ and $s(\omega_0)$ must disagree at least about the element $\omega_0$ and possibly about others. This contradicts the assertion that $B=s(\omega_0)$. Hence, $B$ cannot be obtainable by a finite number of recursive operations.

On the other hand, $B$ is a Borel set. Billingsley gives a complicated argument for this which is a notational nightmare, but I think he is just expressing a proof by induction.

Proof Sketch

The first step of the argument introduces a building block called $D_k$. This is the set of real numbers that are members of $A_k(\omega)$, the $k$th interval on their own list. For instance, $\omega=1/3$ is mapped onto the constant interval list
$$
(0,1], (0,1], (0,1], (0,1], \ldots
$$since $b(1/3)=(2,2,2,\ldots)$, and $(0,1]$ is the second interval on the list. It thus follows that $1/3\in A_k(1/3)$ for all $k$, and thus $1/3 \in D_k$ for all $k$.

The sets $D_k$ are Borel sets. This follows from the fact that the set of binary numbers with a specific $k$th waiting time is a Borel set, and that any rational interval is a Borel set.

The second part of the argument uses a translation between the set
$$
R_n(A_{m_1}(\omega), A_{m_2}(\omega), A_{m_3}(\omega), \ldots)
$$and the set
$$
R_n(D_{m_1}, D_{m_2}, D_{m_3}, \ldots)
$$for an arbitrary integer sequence $m_1, m_2, m_3, \ldots$. This equality is proven by induction.

Using this translation, the set $B^c = \{\omega\in s(\omega)\}$ can then be rewritten as a countable union of sets of this form. This shows that $B$ is indeed a Borel set, since the $D_k$ are.

Friday, April 10, 2015

A note on the Kolmogorov-Dynkin extension theorem

I've been somewhat confused about Kolmogorov's extension theorem, the uniqueness theorem for stochastic processes, but this note by Cozma Shalizi cleared up things a little bit.

Here's the set-up: We have a universe of values, $\Omega$, and this gives rise to a universe of sample paths, $\Omega^\mathbb{N}$. This universe of sample paths can be equipped with a probability measure. Such a measure maps bundles of sample paths, $B \subseteq \Omega^\mathbb{N}$, onto probabilities.

The measure has to respect the usual restrictions, including countable additivity. However, it doesn't have to (and in fact can't) accept all bundles as inputs.

The set of measurable bundles is defined inductively on the basis of one-dimensional base cases and some recursive operations. Specifically, a bundle is measurable if it is
  1. the preimage under a coordinate projection of some measurable set $A \subseteq \Omega$;
  2. constructed out of other measurable bundles by means of complementation, finite intersection, or countable union.
These rules define a system $S$.

We could also define a smaller system, $S^{-}$, by only allowing finite unions. A bundle in this smaller system corresponds to a proposition about a finite-dimensional projection of the stochastic process. If we take the $\sigma$-algebra generated by the smaller system, we recover the full system, $\sigma(S^{-})=S$.

The content of the Daniell-Kolmogorov exention theorem can now be stated as follows: Suppose that two measures $P$ and $Q$ assign the same values to all bundles in $S^{-}$; then they assign the same values to all bundles in $S$. This is equivalent to saying that if two stochastic processes have the same distribution under any finite-dimensional projection, then they have the same distribution.

Not surprisingly, the proof involves some facts about the limit behavior of measures on increasing sets. Specifically, when a series of sets $B_1, B_2, B_3, \ldots$ converges to some limit set $B$ from below, then the measures $\mu(B_1), \mu(B_2), \mu(B_3), \ldots$ must converge to the limit $\mu(B)$. This implies the uniqueness of the infinite-dimensional extension of a set of finite-dimensional projections.

Here is the proof as given by Shalizi: Let $W$ be the set of bundles under which $P$ and $Q$ agree. Then by assumption, $S^{-} \subseteq W \subseteq S$. We want to show that $\sigma(S^{-}) = S  \subseteq W$. We do this by showing that $W$ is a Dynkin system, and that $S^{-}$ is closed under finite intersection, since this will allow us to use Dynkin's $\lambda$-$\pi$ theorem.

A Dynkin system is a system of sets which contains the whole universe, is closed under complementation, and closed under disjoint, countable unions (or equivalently: increasing, countable unions). The set agreement system $W$ satisfies these three properties, since
  1. $P(\Omega^{\mathbb{N}})=Q(\Omega^{\mathbb{N}})=1$, so $P$ and $Q$ agree on the bundle consisting of all sample paths;
  2. For any bundle $B\in W$, $$P(B^c)=1-P(B)=1-Q(B)=Q(B^c),$$ so if $P$ and $Q$ agree on on a bundle, they also agree on its complement;
  3. Finally, if $P$ and $Q$ agree on every bundle in an increasing series of sets $B_1, B_2, B_3, \ldots $, then the two series $$P(B_1), P(B_2), P(B_3), \ldots$$ $$Q(B_1), Q(B_2), Q(B_3), \ldots$$must have the same limit. Hence, whenever $P$ and $Q$ agree on a list of increasing bundles, they thus also agree on the set-theoretic limit of that series.
Thus, $W$ is a Dynkin system. This means that it is large enough to contain all the additional sets we put into the system $S^{-}$ by closing it under $\sigma$-operations. In other words, $\sigma(S^{-}) \subseteq W$. We thus have both $S \subseteq W$ and $W \subseteq S$ (which is true by definition), and thus $W=S$.

Here's an intuitive version of that proof:

The system $S$ on which $P$ and $Q$ are defined must itself be defined by certain recursive operations: Every set in $S$ is constructed from a finite-dimensional base case using only well-behaved operations.

The properties of probability measures guarantee that we can match each of these constructions, operation for operation and step by step, by a corresponding axiom which enforces a certain unique result. Because of this parallelism between the construction of the sample space and the propagation of necessity, no measurable set in $S$ actually has any wiggle room when the bases cases are identical.

Thursday, March 27, 2014

Walters: An Introduction to Ergodic Theory (1982), p. 26

Book cover; from Booktopia.
Section 1.4 of Peter Walters' textbook on ergodic theory contains a proof of Poincaré's recurrence theorem. I found it a little difficult to read, so I'll try to paraphrase the proof here using a vocabulary that might be a bit more intuitive.

The Possible is the Necessarily Possible

The theorem states the following: If
  1. X is a probability space
  2. T: XX is a measure-preserving transformation of X,
  3. E is an event with positive probability,
  4. and x a point in E,
then the series
x, Tx, T2x, T3x, T4x, …
will almost certainly pass through E infinitely often. Or: if it happens once, it will happen again.

The idea behind the proof is to describe the set R of points that visit E infinitely often as the superior limit of a series of sets. This description can then be used to show that ER has the same measure as E. This will imply that almost all points in E revisit E infinitely often.

Statement and proof of the theorem; scan from page 26 in Walters' book.

I'll try to spell this proof out in more detail below. My proof is much, much longer than Walters', but hopefully this means that it's much, much more readable.

Late Visitors

Let Ri be the set of points in X that visit E after i or more applications of T. We can then make two observations about the series R0, R1, R2, R3, …:

First, if j > i, and you visit E at a time later than j, you also visit E at a time later than i. The Ri's are consequently nested inside each other:
R0R1R2R3 ⊃ …
Let's use the name R for the limit of this series (that is, the intersection of the sets). R then consists of all the points in X that visit E infinitely often.

The series of sets is downward converging.

Second, Ri contains the points that visit E at time i or later, and the transformation T–1 takes us one step back in time. The set T–1Ri consequently contains the points in X that visit E at time i + 1 or later. Thus
T–1Ri = Ri + 1.
But since we have assumed that T is measure-preserving, this implies that
m(Ri) = m(T–1Ri) = m(Ri + 1).
By induction, every set in the series thus has the same measure:
m(R0) = m(R1) = m(R2) = m(R3) = …
Or to put is differently, the discarded parts R0\R1, R1\R2, R2\R3, etc., are all null sets.

Intersection by Intersection

So we have that
  1. in set-theoretic terms, the Ri's converge to a limit R from above;
  2. but all the Ri's have the same measure.
Let's use these facts to show that m(ER) = m(E), that is, that we only throw away a null set by intersecting E with R.

The event E and the set R of points that visit the E infinitely often.

To prove this, notice first that every point in E visits E after zero applications of T. Thus, R0E, or in other words, ER0 = E. Consequently,
m(ER0) = m(E).
We now need to extend this base case by a limit argument to show that
m(ER) = m(E).
But, as we have seen above, the difference between R0 and R1 is a null set. Hence, the difference between ER0 and ER1 is also a null set, so
m(ER0) = m(ER1).
This argument holds for any i and i + 1. By induction, we thus get
m(ER0) = m(ER1) = m(ER2) = m(ER3) = …

The probability of a visit to E before but never after time i has probability 0.

Since measures respect limits, this implies that
m(ER0) = m(ER).
But we have already seen that m(ER0) = m(E), this implies that
m(ER) = m(E).

The Wherefore

An informal explanation of what's going on in this proof might be the following:

We are interested in the conditional probability of visiting E infinitely often given that we have visited it once, that is, Pr(R | E). In order to compute this probability, we divide up RE into an infinite number of cases and discover that all but one of them have probability 0.

If you Imagine yourself walking along a sample path, your route will fall in one of the following categories:
  • you never visit E;
  • you visit E for the last time at time i = 0;
  • you visit E for the last time at time i = 1;
  • you visit E for the last time at time i = 2;
  • you visit E for the last time at time i = 3;
  • there is no last time — i.e., you visit E infinitely often.
When we condition on E, the first of these cases has probability 0.

In general, the fact that T is measure-preserving guarantees that it is impossible for an event to occur i times without occurring i + 1 times; consequently, all of the following cases also have probability 0.

We thus have to conclude that the last option — infinitely many visits to E — has the same probability as visiting E once, and thus a conditional probability of 1.