## Wednesday, September 9, 2015

### Billingsley: Probability and Measure, Ch. 1.4

Prompted by what I found to be quite confusing about Patrick Billingsley's presentation of Kolmogorov's zero-one law, I've been reflecting a bit on the essence of the proof, and I think I've come to a deeper understand of the core of the issue now.

The theorem can be stated as follows: Let $S$ be an infinite collection of events equipped with a probability measure $P$ (which, by Kolmogorov's extension theorem, can be defined exhaustively on the finite subsets of $S$). Suppose further that $A$ is some event in the $\sigma$-extension of $S$ which is independent of any finite selection of events from $S$. Then $A$ either has probability $P(A)=0$ or $P(A)=1$.

The proof is almost evident from this formulation: Since $A$ is independent of any finite selection of events from $S$, the measures $P(\,\cdot\,)$ and $P(\,\cdot\,|\,A)$ coincide on all events that can be defined in terms of a finite number of events from $S$. But by Kolmogorov's extension theorem, this means that the conditional and the unconditional measures extend the same way to the infinite collection. Hence, if P$(A)>0$, this equality also applies to $A$, and thus $P(A\,|\,A)=P(A)$. This implies that $P(A)^2=P(A)$, which is only satisfied by $P(A)=1$.

So the core of the proof is that independence of finite selections implies independence in general. What makes Billingsley's discussion of the theorem appear a bit like black magic is that he first goes through a series of steps to define independence in the infinite case before he states the theorem. But this makes things more murky than they are in Kolmogorov's own statement of the theorem, and it hides the crucial limit argument at the heart of the proof.

An example of a "tail event" which is independent of all finite evidence is the occurrence of infinitely many of the events $A_1, A_2, A_3, \ldots$. The reasons that this is independent of an event $B \in \sigma(A_1, A_2, \ldots, A_N)$ is that$$\forall x\geq 1 \, \exists y\geq x \, : A_y$$is logically equivalent to$$\forall x\geq N \, \exists y\geq x \, : A_y.$$Conditioning on this some initial segment thus does not change the probability of this event.

Note, however, that this is not generally the case for events of the type$$\forall x\geq 1 \, \exists y\geq 1 \, : A_y.$$It is only the "$y\geq x$" in the previous case that ensures the equivalence.

A statement like "For all $i$, $X_i$ will be even" is for instance not a tail event, since a finite segment can show a counterexample, e.g., $X_1=17$. Crucially, however, this example fails to be a tail event because the events "$X_i is even$" (the inner quantification) can be written as a disjunctions of finitely many simple events. We can thus give a counterexample to the outer quantification ($\forall i$) by exhibiting a single $i$ for which the negation of "$X_i$ is even" (which is a universal formula) is checkable in finite time.

Reversely, if this were not the case for any of the statements in the inner loop, the event would be a tail event. That is, if the universal quantification were over a list of events which had no upper limit on the potential index of the verifier, then finite data could not falsify the statement. This happens when the existence of a single potential verifier implies the existence of infinitely many (as it does in the case of "infinitely often" statements, since any larger $y$ is an equally valid candidate verifier).

Events of the form $\exists y: A_y$ are also not tail events, since they are not independent of the counterexamples $\neg A_1, \neg A_2, \neg A_3\ldots$. They are, however, independent of any finite selection of positive events (which do not entail the negation of anything on the list).

We thus have a situation in which sets at the two lowest levels of the Borel hierarchy can have probabilities of any value in $[0,1]$, but as soon as we progress beyond that in logical complexity, only the values 0 and 1 are possible.

 Illustration by Yoni Rozenshein.

Oddly enough, this means that no probability theory is possible on complex formulas: When a probability measure is defined in terms of a set of simple events, then the $\Delta_2$ events always have probability 0 or 1. This property is conserved at higher orders in the hierarchy, since the quantifications that push us up the hierarchy are countable (and a countable union of null sets is a null set, by the union bound).

Note also that$$\lim_{N\rightarrow \infty} \cup_{i=1}^{N} \cap_{j=1}^{N} A_{ij} \;=\; \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij},$$and since $$P\left( \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij} \right) \;\in\; \{0,1\},$$the probabilities of the finite approximations must converge to these extremes. As we spend more computational power checking the truth of a tail event on a specific sample, we thus get estimates that approach 0 or 1 (although without any general guarantees of convergence speed).

This sheds some light on what the theorem actually means in practice, and how it relates to the zero-one theorem for finite models of first-order logic.