Thursday, October 22, 2015

Bell: "Bertlmann's Socks and the Nature of Reality" (1980)

There seems to be a lot of mystery surrounding Bell's theorems. Scholarpedia has a whole section devoted to "controversy and common misunderstandings" surrounding the argument, and a recent xkcd comic took up the topic (without mentioning any specific misunderstanding).

Source: xkcd.com/1591

I've also been told in person that I understood the theorem wrong. So it seems about time for some studying.

This time, rather than pick up Bell's original article, I read this more popular account of the argument, which covers more or less the same ground. If I understand it correctly, it's actually simpler than what I first thought, although my hazy understanding of the physics stood in the way of extracting the purely statistical part of the argument.

Background

Here's what I take to be the issue: We have a certain experiment in which two binary observables, $A$ and $B$, follow conditional distributions that depend on two control variables, $a$ and $b$:\begin{eqnarray}
a &\longrightarrow& A \\
b &\longrightarrow& B
\end{eqnarray}Although the experiment is designed to prevent statistical dependencies between $A$ and $B$, we still observe a marked correlation between them for many settings of $a$ and $b$. This has to be explained somehow, either by postulating
  • an unobserved common cause: $\lambda\rightarrow A,B$;
  • an observed common effect: $A,B \rightarrow \gamma$ (i.e., a sampling bias);
  • or a direct causal link: $A \leftrightarrow B$.
The purpose of Bell's paper is to rule out the most plausible and attractive of these three options, the hidden common cause. This explanation is ruled out by showing that a certain measure of dependence would exceed a logically necessary bound under this type of explanation.

Measurable Consequences

The measure in question is the following:
\begin{eqnarray}
C(a,b) &=& +P(A=1,B=1\,|\,a,b) \\
           &   & +P(A=0,B=0\,|\,a,b) \\
           &  &  -P(A=1,B=0\,|\,a,b) \\
           &  &  -P(A=0,B=1\,|\,a,b).
\end{eqnarray}This statistic is related to the correlation between $A$ and $B$ but different due to the absence of marginal probabilities $P(A)$ and $P(B)$. It evaluates to $+1$ if and only if the two are perfectly correlated, and $-1$ if and only if they are perfectly anti-correlated.

Contours of $C(a,b)$ when $A$ and $B$ are independent with $x=P(A)$ and $y=P(B)$.

In a certain type of experiment, where $a$ and $b$ are angles of two magnets used to reveal something about the spin of a particle, quantum mechanics predicts that
$$
C(a,b) \;=\; -\cos(a-b).
$$When the control variables only differ little, $A$ and $B$ are thus strongly anti-correlated, but when the control variables are on opposite sides of the unit circle, $A$ and $B$ are closely correlated. This is a prediction based on physical considerations.

Bounds on Joint Correlations

However, let's stick with the pure statistics a bit longer. Suppose again $A$ depends only on $a$, and $B$ depends only on $b$, possibly given some fixed, shared background information which is independent of the control variables.

The statistical situation when the background information is held constant.

Then $C(a,b)$ can be expanded to
\begin{eqnarray}
C(a,b) &=& +P(A=1\,|\,a) \, P(B=1\,|\,b) \\
           &   & +P(A=0\,|\,a) \, P(B=0\,|\,b) \\
           &   & - P(A=1\,|\,a) \, P(B=0\,|\,b) \\
           &   & - P(A=0\,|\,a) \, P(B=1\,|\,b) \\
           &=& [P(A=1\,|\,a) - P(A=0\,|\,a)] \times [P(B=1\,|\,b) - P(B=0\,|\,b)],
\end{eqnarray}that is, the product of two statistics which measure how stochastic the variables $A$ and $B$ are given the control parameter settings. Using obvious abbreviations,
$$
C(a,b) \; = \; (A_1 - A_0) (B_1 - B_0),
$$and thus
\begin{eqnarray}
C(a,b) + C(a,b^\prime) &=&
     (A_1 - A_0) (B_1 - B_0 + B_1^\prime - B_0^\prime)
     & \leq & (B_1 - B_0 + B_1^\prime - B_0^\prime); \\
C(a^\prime,b) - C(a^\prime,b^\prime) &=& (A_1^\prime - A_0^\prime) (B_1 - B_0 - B_1^\prime + B_0^\prime)
    & \leq & (B_1 - B_0 - B_1^\prime + B_0^\prime).
\end{eqnarray}It follows that
$$
C(a,b) + C(a,b^\prime) + C(a^\prime,b) - C(a^\prime,b^\prime) \;\leq\; 2(B_1 - B_0) \;\leq\; 2.
$$Since $(B_1 - B_0)\geq-1$, a similar derivation shows that
$$
| C(a,b) + C(a,b^\prime) + C(a^\prime,b) - C(a^\prime,b^\prime) | \;\leq\; 2|B_1 - B_0| \;\leq\; 2.
$$In fact, all 16 variants of this inequality, with the signs alternating in all possible ways, can be derived using the same idea.

Violations of Those Bounds

But now look again at
$$
C(a,b) \;=\; -\cos(a-b).
$$We then have, for $(a,b,a^\prime,b^\prime)=(0,\pi/4,\pi/2,-\pi/4)$,
$$
\left| C\left(0, \frac{\pi}{4}\right) + C\left(0, -\frac{\pi}{4}\right) + C\left(\frac{\pi}{4}, \frac{\pi}{4}\right) - C\left(\frac{\pi}{4}, -\frac{\pi}{4}\right) \right|  \;=\;  -2\sqrt{2},
$$which is indeed outside the interval $[-2,2]$. $C$ can thus not be of the predicted functional form and at the same time satisfy the bound on the correlation statistics. Something's gotta give.

Introducing Hidden Variables

This entire derivation relied on $A$ and $B$ depending on nothing other than their own private control variables, $a$ and $b$.

However, suppose that a clever physicist proposes to explain the dependence between $A$ and $B$ by postulating some unobserved hidden cause influencing them both. There is then some stochastic variable $\lambda$ which is independent of the control variables, yet causally influences both $A$ and $B$.

The statistical situation when the background information varies stochastically.

However, even if this is the case, we can go through the entire derivation above, adding "given $\lambda$" to every single step of the process. As long as we condition on a fixed value of lambda, each of the steps still hold. But since the inequality thus is valid for every single value of $\lambda$, it is also valid in expectation, and we can thus integrate $\lambda$ out; the result is that even under such a "hidden variable theory," the inequality still holds.

Hence, the statistical dependency cannot be explained by a shared cause alone, since the functional form of the probability densities for $A$ given $a$ and $B$ given $b$ are of a wrong form. We will therefore need to either postulate direct causality between $A$ and $B$ or an observed downstream variable (sampling bias) instead.

Note that the only thing we really need to prove this result is the assumption that the probability $P(A,B \, | \, a,b,\lambda)$ factors into the product $P(A \, | \, a,b,\lambda)\, P(B \, | \, a,b,\lambda)$. This corresponds to the assumption that there is no direct causal connection between $A$ and $B$.

Wednesday, September 16, 2015

Billsingley: Ergodic Theory and Information (1965), pp. 12–14

In the process of proving the ergodic theorem, Billingsley also provides in passing a proof for a weaker theorem related to mixing operations: This states that for a class of processes that satisfy a certain limiting independence condition, visiting frequencies converge to probabilities.

Definitions

More precisely, a time shift $T$ is said to be mixing (p. 12) if$$
P(A \cap T^{-n}B)  \;  \rightarrow  \;  P(A)P(B)
$$for all sets $A$ and $B$. The process is thus mixing if information about the present is almost entirely independent of the far future. (A sample path $\omega$ belongs to the set $T^{-n}B$ if $\omega$ will visit $B$ after $n$ time shifts forward.) Billsingley doesn't comment on the choice of the term "mixing," but I assume it should be read as an ill-chosen synonym for "shuffling" in this context.

We will now be interested in the behavior of the visiting frequencies$$
P_n(A) \; := \;  \frac{1}{n} \sum_{k=0}^{n-1} \mathbb{I}_A(T^k\omega),
$$where $\mathbb{I}_A$ is the indicator function of some bundles of sample paths. This quantity is a random variable with a distribution defined by the distribution on the sample paths $\omega$.

Note that the traditional notion of visiting frequency (i.e., how often is the ace of spades on top of the deck?) is a special case of this concept of visiting frequency, in which the bundle $A$ is defined entirely in terms of the 0th coordinate of the sample path $\omega = (\ldots, \omega_{-2}, \omega_{-2}, \omega_{0}, \omega_{1}, \omega_{2}, \ldots)$. In the general case, the question of whether $\omega\in A$ could involve information stored at arbitrary coordinates of $\omega$ (today, tomorrow, yesterday, or any other day).

Theorem and Proof Strategy

The mixing theorem now says the following: Suppose that $P$ is a probability measure on the set of sample paths, and suppose that the time shift $T$ is mixing and measure-preserving with respect to this measure. Then$$
P_n(A)  \;  \rightarrow  \;  P(A)
$$in probability.

The proof of this claim involves two things:
  1. Showing that $E[P_n(A)] = P(A)$;
  2. Showing that $Var[P_n(A)] \rightarrow 0$ for $n \rightarrow \infty$.
These two assumptions collectively imply convergence in probability (by Markov's inequality).

Identity in Mean

The time shift $T$ is assumed to preserve the measure $P$. We thus have $$
E[f(\omega)] \; = \; E[f(T\omega)] \; = \; E[f(T^2\omega)] \; = \; \cdots
$$for any measurable function $f$. It follows that$$
E[\mathbb{I}_A(\omega)] \; = \;
E[\mathbb{I}_A(T\omega)] \; = \;
E[\mathbb{I}_A(T^2\omega)] \; = \; \cdots
$$and that these are all equal to $P(A)$. By the linearity of expectations, we therefore get that $E[P_n(A)] = P(A)$.

This proves that $P_n(A)$ at least has the right mean. Convergence to any other value than $P(A)$ is therefore out of the question, and it now only remains to be seen that the variance of $P_n(A)$ also goes to 0 as $n$ goes to infinity.

Vanishing Variance: The Idea

Consider therefore the variance$$
Var[P_n(A)]  \; = \;  E\left[ \left(P_n(A) - P(A)\right)^2 \right].
$$In order to expand the square under this expectation, it will be helpful to break the term $P(A)$ up into $n$ equal chunks and then write the whole thing as a sum:$$
P_n(A) - P(A)  \; = \;  \frac{1}{n} \sum_{k=0}^{n-1} (\mathbb{I}_A(T^k\omega) - P(A)).
$$If we square of this sum of $n$ terms and take the expectation of this expasion, we will get a sum of $n^2$ cross-terms, each of which are expected values of the form$$
Cov(i,j)  \; := \;  E\left[ \left(\mathbb{I}_A(T^i\omega) - P(A)\right) \times \left (\mathbb{I}_A(T^j\omega) - P(A)\right) \right].
$$By expanding this product and using that $E[\mathbb{I}_A(T^k\omega)]=P(A)$, we can find that$$
Cov(i,j)  \; = \;  E\left[\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)\right] - P(A)^2.
$$The random variable $\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)$ takes the value 1 on $\omega$ if and only if $\omega$ will visit $A$ after both $i$ and $j$ time steps. Hence,$$
Cov(i,j)  \; = \;  P(T^{-i}A \cap T^{-j}A)  -  P(A)^2,
$$and we can write$$
Var[P_n(A)]  \; = \;  \frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j).
$$The goal will now be to show the sum of these $n \times n$ converges to $0$ as we increase $n$.

Vanishing Variance: The Steps

Since $T$ is assumed to be measure-preserving, the value of $Cov(i,j)$ is completely determined by the distance between $i$ and $j$, not their absolute values. We thus have$$
Cov(i,j)  \; = \;  Cov(n+i, n+j)
$$for all $i$, $j$, and $n$.

By the mixing assumption, $Cov(0,j) \rightarrow 0$ for $j \rightarrow \infty$. Together with the observation about $|i-j|$, this implies that $$Cov(i,j) \;\rightarrow\; 0$$ whenever $|i-j| \rightarrow \infty$. If we imagine the values of  $Cov(i,j)$ tabulated in a huge, square grid, the numbers far away from the diagonal are thus tiny. We will now sum up the value of these $n^2$ numbers $Cov(i,j)$ by breaking them into $n$ classes according to the value of $d = |i-j|$.

To do so, we upper-bound the number of members in each class by $2n$: There are for instance $n$ terms with $|i-j|=0$, and $2(n-1)$ terms with $|i-j|=1$. (This upper-bound corresponds graphically to covering the $n\times n$ square with a tilted $n\sqrt{2} \times n\sqrt{2}$ square, except for some complications surrounding $d=0$.)

We thus have$$
Var[P_n(A)]
\; = \;
\frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j)
\;  \leq  \;
\frac{2}{n} \sum_{d=0}^{n-1} Cov(0,d).
$$This last quantity is an average of a list of numbers that converge to $0$; this average itself must therefore also converge to $0$ as $n\rightarrow \infty$. This proves that$$
Var[P_n(A)]  \; \rightarrow \;  0]
$$for $n \rightarrow \infty$, and we can conclude that $P_n(A) \rightarrow P(A)$ in probability.

Applicability

Note that this mixing assumption is very strong. A periodic Markov chain, for instance, will usually not be mixing: Consider for instance a process that alternates between odd and even values; the the joint probability $P(evens \cap T^n odds)$ will alternate between 0 and 1 as $n$ runs through the natural numbers. In many cases of practical interest, we therefore do in fact need the increased generality of the classic ergodic theorem.

Wednesday, September 9, 2015

Billingsley: Probability and Measure, Ch. 1.4

Prompted by what I found to be quite confusing about Patrick Billingsley's presentation of Kolmogorov's zero-one law, I've been reflecting a bit on the essence of the proof, and I think I've come to a deeper understand of the core of the issue now.

The theorem can be stated as follows: Let $S$ be an infinite collection of events equipped with a probability measure $P$ (which, by Kolmogorov's extension theorem, can be defined exhaustively on the finite subsets of $S$). Suppose further that $A$ is some event in the $\sigma$-extension of $S$ which is independent of any finite selection of events from $S$. Then $A$ either has probability $P(A)=0$ or $P(A)=1$.

The proof is almost evident from this formulation: Since $A$ is independent of any finite selection of events from $S$, the measures $P(\,\cdot\,)$ and $P(\,\cdot\,|\,A)$ coincide on all events that can be defined in terms of a finite number of events from $S$. But by Kolmogorov's extension theorem, this means that the conditional and the unconditional measures extend the same way to the infinite collection. Hence, if P$(A)>0$, this equality also applies to $A$, and thus $P(A\,|\,A)=P(A)$. This implies that $P(A)^2=P(A)$, which is only satisfied by $P(A)=1$.

So the core of the proof is that independence of finite selections implies independence in general. What makes Billingsley's discussion of the theorem appear a bit like black magic is that he first goes through a series of steps to define independence in the infinite case before he states the theorem. But this makes things more murky than they are in Kolmogorov's own statement of the theorem, and it hides the crucial limit argument at the heart of the proof.


An example of a "tail event" which is independent of all finite evidence is the occurrence of infinitely many of the events $A_1, A_2, A_3, \ldots$. The reasons that this is independent of an event $B \in \sigma(A_1, A_2, \ldots, A_N)$ is that$$
\forall x\geq 1 \, \exists y\geq x \, : A_y
$$is logically equivalent to$$
\forall x\geq N \, \exists y\geq x \, : A_y.
$$Conditioning on this some initial segment thus does not change the probability of this event.

Note, however, that this is not generally the case for events of the type$$
\forall x\geq 1 \, \exists y\geq 1 \, : A_y.
$$It is only the "$y\geq x$" in the previous case that ensures the equivalence.

A statement like "For all $i$, $X_i$ will be even" is for instance not a tail event, since a finite segment can show a counterexample, e.g., $X_1=17$. Crucially, however, this example fails to be a tail event because the events "$X_i is even$" (the inner quantification) can be written as a disjunctions of finitely many simple events. We can thus give a counterexample to the outer quantification ($\forall i$) by exhibiting a single $i$ for which the negation of "$X_i$ is even" (which is a universal formula) is checkable in finite time.

Reversely, if this were not the case for any of the statements in the inner loop, the event would be a tail event. That is, if the universal quantification were over a list of events which had no upper limit on the potential index of the verifier, then finite data could not falsify the statement. This happens when the existence of a single potential verifier implies the existence of infinitely many (as it does in the case of "infinitely often" statements, since any larger $y$ is an equally valid candidate verifier).

Events of the form $\exists y: A_y$ are also not tail events, since they are not independent of the counterexamples $\neg A_1, \neg A_2, \neg A_3\ldots$. They are, however, independent of any finite selection of positive events (which do not entail the negation of anything on the list).

We thus have a situation in which sets at the two lowest levels of the Borel hierarchy can have probabilities of any value in $[0,1]$, but as soon as we progress beyond that in logical complexity, only the values 0 and 1 are possible.

Illustration by Yoni Rozenshein.

Oddly enough, this means that no probability theory is possible on complex formulas: When a probability measure is defined in terms of a set of simple events, then the $\Delta_2$ events always have probability 0 or 1. This property is conserved at higher orders in the hierarchy, since the quantifications that push us up the hierarchy are countable (and a countable union of null sets is a null set, by the union bound).

Note also that$$
\lim_{N\rightarrow \infty} \cup_{i=1}^{N} \cap_{j=1}^{N} A_{ij}
\;=\;  \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij},
$$and since $$
P\left( \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij} \right)  \;\in\;  \{0,1\},
$$the probabilities of the finite approximations must converge to these extremes. As we spend more computational power checking the truth of a tail event on a specific sample, we thus get estimates that approach 0 or 1 (although without any general guarantees of convergence speed).

This sheds some light on what the theorem actually means in practice, and how it relates to the zero-one theorem for finite models of first-order logic.

Thursday, September 3, 2015

Billingsley: Ergodic Theory and Information (1965)

I'm reading this book (now for the second time) for its proof of Birkhoff's ergodic theorem. I was held up by lots of technical details the first time around, but I've gotten a bit further now.

Definitions

The set-up is the following: A set $X$ is given along with a time shift $T: X\rightarrow X$. We further have a probability distribution on the set of sample paths $\Omega=X^{\infty}$. For each integrable function $f: \Omega\rightarrow \mathbb{R}$, we consider the time-average$$
 A_n f(\omega) \;=\; \frac{1}{n} \sum_{i=0}^{n-1} f(T^i \omega).
$$We are interested in investigating whether such averages converge as $n\rightarrow\infty$. If so, we would also like to know whether the limit is the same on all sample paths.

Notice that $f$ is a function of the whole sample path, not just of a single coordinate. The function $f$ could for instance measure the difference between two specific coordinates, or it could return a limiting frequency associated with the sample path.

If $f$ has the same value on the entire orbit $$\omega,\, T\omega,\, T^2\omega,\, T^3\omega,\, \ldots$$we say that the function $f$ is time-invariant. Global properties like quantifications or tail events are time-invariant.

A set $A$ is similarly time-invariant if $TA=A$. (I like to call such invariants as "trapping sets.") This is a purely set-theoretic notion, and doesn't involve the probability measure on the sample paths. However, we say that an invariant set $A$ is a non-trivial if $0 < P(A) < 1$.

Statement

One of the things that are confusing about the so-called ergodic theorem is that it actually doesn't involve ergodicity very centrally. Instead it makes the following statement:
  1. Suppose that the time shift $T$ is measure-preserving in the sense that $P(A)=P(TA)$ for all measurable sets $A$. Then the time-averages of any integrable function converges almost everywhere:$$P\left\{\omega : \lim_{n\rightarrow \infty} A_n f(\omega) \textrm{ exists} \right\} \;=\; 1.$$
  2. Suppose in addition that the time shift is ergodic in the sense that all its trapping are trivial with respect to $P$. Then the time averages are all the same on a set with $P$-probability 1:$$P \left\{(\omega_1, \omega_2) : \lim_{n\rightarrow\infty} A_n f(\omega_1) \,=\, \lim_{n\rightarrow\infty} A_n f(\omega_2) \right\} \;=\; 1.$$
Almost all the mathematical energy is spent proving the first part of this theorem. The second part is just a small addendum: Suppose a function $f$ has time averages which are not almost everywhere constant. Then the threshold set$$
\{A_n f \leq \tau\}
$$must have a probability strictly between 0 and 1 for some threshold $\tau$. (This is a general feature of random variables that are not a.e. constant.)

But since the limiting time-average is a time-invariant property (stating something about the right-hand tail of the sample path), this thresholding set is an invariant set. Thus, if the limiting time-averages are not a.e. unique, the time shift has a non-trivial trapping set.

So again, the heavy lifting lies in proving that the time-averages actually converge on all sample paths when the time shift is measure-preserving. Billingsley gives two proofs of this, one following idea by von Neumann, and one following Birkhoff.

The Von Neumann Proof

Let's say that a function is "flat" if it has converging time-averages. We can then summarize von Neumann's proof along roughly these lines:
  1. Time-invariant functions are always flat.
  2. Increment functions of the form $$s(\omega) \;=\; g(\omega) - g(T\omega),$$ where $g$ is any function, are also flat.
  3. Flatness is preserved by taking linear combinations.
  4. Any square-integrable function can be obtained as the $L^2$ limit of (linear combinations of) invariant functions and increment functions.
  5. Moreover, the $L^2$ limit of a sequence of flat $L^2$ functions is flat; in other words, flatness is preserved by limits.
  6. Hence, all $L^2$ functions are flat.
This argument seems somewhat magical at first, but I think more familiarity with the concept of orthogonality and total sets in the $L^p$ Hilbert spaces could make it seem more natural.

Just as an example of how this might work, consider the coordinate function $f(\omega) = X_0(\omega)$. In order to obtain this function as a limit of our basis functions, fix a decay rate $r < 1$ and consider the function$$
g(\omega) \;=\; X_0(\omega) + r X_1(\omega) + r^2 X_2(\omega) + r^3 X_3(\omega) + \cdots
$$Then for $r \rightarrow 1$, we get a converging series of increment functions, $g(\omega) - g(T\omega) \;\rightarrow\; X_0(\omega)$. If the coordinates are bounded (as in a Bernoulli process), this convergence is uniform across all sample paths. Billingsley uses an abstract and non-constructive orthogonality argument to show that such decompositions are always possible.

In order to show that the limit of flat functions is also flat, suppose that $f_1, f_2, f_3, \ldots$ are all flat, and that $f_k \rightarrow f^*$. We can then prove that the averages $A_1f^*, A_2f^*, A_3f^*, \ldots$ form a Cauchy sequence by using the approximation $f^* \approx f_k$ to show that $$| A_n f^* - A_m f^* | \;\approx\; | A_n f_k - A_m f_k | \;\rightarrow\; 0.$$ Of course, the exact argument needs a bit more detail, but this is the idea.

After going through this $L^2$ proof, Billingsley explains how to extend the proof to all of $L^1$. Again, this is a limit argument, showing that the flatness is preserved as we obtain the $L^1$ functions as limits of $L^2$ functions.

The Birkhoff Proof

The second proof follows Birkhoff's 1931 paper. The idea is here to derive the the ergodic theorem from the so-called maximal ergodic theorem (which Birkhoff just calls "the lemma"). This theorem states that if $T$ preserves the measure $P$, then$$
P\left\{ \exists n: A_n f > \varepsilon \right\}
\;\leq\;
\frac{1}{\varepsilon} \int_{\{f>\varepsilon\}} f \;dP.
$$I am not going to go through the proof, but here a few observations that might give a hint about how to interpret this theorem:
  1. For any arbitrary function, we have the inequality $$P(f>\varepsilon) \;\leq\; \frac{1}{\varepsilon} \int_{\{f>\varepsilon\}}f\;dP.$$This is the generalization of Markov's inequality to random variables that are not necessarily positive.
  2. If $f$ is time-invariant, then $A_n f = f$, because the average $A_n f$ then consists of $n$ copies of the number $$f(\omega) \;=\; f(T\omega) \;=\; f(T^2\omega) \;=\; \cdots \;=\; f(T^{n-1}\omega).$$ In this case, the theorem thus follows from the generalized Markov bound above.
  3. Even though we do not assume that $f(\omega)=f(T\omega)$ for a fixed sample path $\omega$, all distributional statements we can make about $f(\omega)$ also apply to $f(T\omega)$ because $T$ is assumed to be measure-preserving; for instance, the two functions have the same mean when $\omega \sim P$. This fact of course plays a crucial part in the proof.
  4. By putting $(f,\varepsilon) := (-f,-\varepsilon)$, we see that the maximal ergodic theorem is equivalent to$$P\left\{ \exists n: A_n f < \varepsilon \right\}
    \;\leq\;
    \frac{1}{\varepsilon} \int_{\{f<\varepsilon\}} f \;dP.
    $$
The maximal ergodic theorem can be used to prove the general (or "pointwise") ergodic theorem. The trick is to find both an upper and a lower bound on the non-convergence probability
$$
q \;=\; P\{ \omega :\, \liminf A_n f(\omega) < a < b < \limsup A_n f(\omega) \},
$$with the goal of showing that these bounds can only be satisfied for all $a<b$ when $q=0$.

Comments

One of the things I found confusing about the ergodic theorem is that it is often cited as a kind of uniqueness result: It states, in a certain sense, that all sample paths have the same time-averages.

However, the problem is that when the time shift has several trapping sets, there are multiple probability distributions that make the time shift ergodic. Suppose that $T$ is non-ergodic with respect to some measure $P$; and suppose further that $A$ is a minimal non-trivial trapping set in the sense that it does not itself contain any trapping sets $B$ with $0 < P(B) < P(A) < 1$. Then we can construct a measure with respect to which $T$ will be ergodic, namely, the conditional measure $P(\,\cdot\,|\,A)$.

This observation reflects the fact that if we sample $\omega \sim P$, the sample path we draw may lie in any of the minimal non-trivial trapping sets for $T$. Although it might not be clear based on a finite segment of $\omega$ (because such a segment might not reveal which trapping set we're in), such a sample path will stay inside this minimal trap set forever, and thus converge to the time-average characteristic of that set.

A small complication here is that even a minimal trapping set might in fact contain other trapping sets that are smaller: For instance, the set of coin flipping sequences whose averages converge to 0 are a trapping set, but so is the sequence of all 0s. However, this singleton set will either have positive probability (in which case the limit set is not minimal) or probability 0 (in which case it can never come up as a sample from $P$).

Lastly, suppose the sample path $\omega$ is not sampled from $P$, but from some other measure $Q$ which is not preserved by $T$. Can we say that $A_nQ\rightarrow P$?

Not always, but a sufficient condition for this two be the case is that $Q$ is absolutely continuous with respect to $P$ ($Q \ll P$). If that is the case, then the $Q$-probability of observing a non-converging average will be 0, since the $P$-probability of this event is 0, and $P$ dominates $Q$.

I am not sure whether this condition is also necessary. A degenerate measure $Q$ which places all mass on a single sample path, or even on countably many, can often be detected by some integrable function $f$ such as the indicator function of everything except the points on the countably many countable sample paths.

Monday, May 25, 2015

Newman et al.: "An Event-Related fMRI Study of Syntactic and Semantic Violations" (2001)

This paper reports on a brain imaging study in which people were given either ordinary English sentences or sentences with one of two types of error (p. 349):
  • Yesterday I sailed Todd's hotel to China. (semantic violation)
  • Yesterday I cut Max's with apple. (syntactic violation)
The sentences were presented one word at a time. They don't seem to say when recordings started, but they do mention "critical" words (p. 139).

The results were the following:

The syntactic errors lit up Brodmann's area 6, which includes the premotor cortex. It also activated the a spot in superior temporal gyrus which was either Brodmann's area 21 or 22.

The semantic errors activated a number of regions, including several quite frontal regions. The strongest activation was here inside the the fissure between the two halves of the brain, that is, in the medial section of the cortex.

Here a table:


And here is a picture (Fig. 1, p. 352):


The picture is ugly as hell, but I like that they are so infatuated with this beautiful brain: "The structural image used is from one subject, chosen for its particularly high quality … the choice of one anatomical image over another does not alter the localization of the results—the differences are purely aesthetic." (From the caption of Fig. 1.)

Regarding the syntactic observations, they comment:
Our findings suggest that the superior frontal gyri, including not only premotor cortex, but those portions of it which likely correspond to the supplementary motor area, are involved in the processing of syntactic phrase structure violations. While this is an unusual finding, … it is not unique: Ni et al. (2000, Experiment 2) found superior frontal gyrus activation in response to morphosyntactic violations. (p. 355)
What do we make of this? Well, it would seem that what you do when you read I cut with cake knife is quite literally comparable to pulling an emergency brake: You use the same inhibitory resources as those that are involved in holding yourself back from pushing the wrong button in a stop-go experiment, or training yourself to perform a complex series of movements with your fingers.

Lau, Phillips, and Peoppel: "A cortical network for semantics" (2008)

This paper contrasts two competing hypotheses about what the N400 signals:
  1. an "integration" theory which alleges that "the N400 effect … reflects the process of semantic integration of the critical word with the working context" (p. 921);
  2. a "lexical" theory which alleges that "the difference between the effects of anomalous and predictable endings arises not because of the anomaly but because predictable words in context are easier to access from memory" (p. 921).
The authors eventually decide that the lexical theory is more likely in view of the data. This conclusion is mostly based on a consideration of results from semantic priming experiments, which are supposed to isolate the task of retrieving a word from memory.

They use the following model of the anatomical distribution of linguistic processing (Fig. 2, p. 923):


Their conclusion is thus that the substrate responsible for the absence of an N400 effect is the pinkish rectangle at the top. It's the brain region which sits underneath your left ear, give or take.

In addition to this discussion, the paper contains a number of references to non-linguistic stimuli that can also produce an N400 effect:
  1. Barrett, S. E., Rugg, M. D. & Perrett, D. I. Event-related potentials and the matching of familiar and unfamiliar faces. Neuropsychologia 26, 105–117 (1988).
  2. Barrett, S. E. & Rugg, M. D. Event-related potentials and the semantic matching of faces. Neuropsychologia 27, 913–922 (1989). (Yes, that is a different title.)
  3. Barrett, S. E. & Rugg, M. D. Event-related potentials and the semantic matching of pictures. Brain Cogn. 14, 201–212 (1990).
  4. Holcomb, P. J. & McPherson, W. B. Event-related brain potentials reflect semantic priming in an object decision task. Brain Cogn. 24, 259–276 (1994).
  5. Ganis, G., Kutas, M. & Sereno, M. I. The search for “common sense”: an electrophysiological study of the comprehension of words and pictures in reading. J. Cogn. Neurosci. 8, 89–106 (1996).
  6. Van Petten, C. & Rheinfelder, H. Conceptual relationships between spoken words and environmental sounds: event-related brain potential measures. Neuropsychologia 33, 485–508 (1995)
  7. Ganis, G. & Kutas, M. An electrophysiological study of scene effects on object identification. Cogn. Brain Res. 16, 123–144 (2003).
  8. Sitnikova, T., Kuperberg, G. & Holcomb, P. J. Semantic integration in videos of real-world events: an electrophysiological investigation. Psychophysiology 40, 160–164 (2003).
  9. Sitnikova, T., Holcomb, P. J., Kiyonaga, K. A. & Kuperberg, G. R. Two neurocognitive mechanisms of semantic integration during the comprehension of visual real-world events. J. Cogn. Neurosci. 20, 2037–2057 (2008).
  10. Plante, E., Petten, C. V. & Senkfor, A. J. Electrophysiological dissociation between verbal and nonverbal semantic processing in learning disabled adults. Neuropsychologia 38, 1669–1684 (2000).
  11. Orgs, G., Lange, K., Dombrowski, J. H. & Heil, M. N400-effects to task-irrelevant environmental sounds: further evidence for obligatory conceptual processing. Neurosci. Lett. 436, 133–137 (2008).
That is a lot of reading.

Baggio and Hagoort: "The balance between memory and unification in semantics" (2011)

This paper contains some interesting anatomical details about how the N400 component might be produced.

Once Upon a Time

The theory proposed is essentially this:

First, the temporal regions of the brain get their cues from the visual about 200ms after reading a word. Then, they go on to pull the right semantic building blocks out of the drawer. After that, they send a message to the inferior frontal parts of the cortex, which then helps construct the right sentence-level meaning out of the building blocks.

The idea is then that if this construction process runs into trouble, an excess of negative charge is observed after another 200ms.

Cycles and Loops

The theory also spells out the directionality of the communication: The visual cortex can talk to the temple, but the temple can't talk back; the temple can talk to itself as well as to the forehead, but the forehead can only talk back to the temple, not to itself. This allegedly explains the timeline of activation in these two areas.

In schematic form, the feedback loops thus look as follows (their Fig. 3, p. 18):


In terms of the anatomy involved, these are the underlying brain structures (Fig. 2, p. 16):


The different little blobs have names, but they aren't much differentiated in the larger narrative.

Wheres, Whats, and Whens

It's a little surprising that the N400 should be attributed to the construction of meaning rather than simply to the matching of data with expectations.

The argument for this relies crucially on the assumption that the frontal brain regions are instrumental in provoking the N400: Much of the evidence cited on page 13 assumes that this is true and uses this to link the timeline constructed on the basis of EEG studies to the topological map constructed on the basis of fMRI studies.

This data doesn't bear on the question if increased activation in the frontal areas aren't related to the N400. Although ambiguity does indeed activate the frontal cortex, this does not necessarily imply that ambiguity is related to the N400.

Contents

Rather than go through the article step by step, I thought it might be good to provide an annotated table of contents of its six sections. Now that I see how long my comments became, I'm not so sure. At any rate, here it is:
  1. INTRODUCTION: We can understand "combinatorics" like syntactic analysis as a last resort to make sense of inputs that are not seen before in the exact same form (p. 2).
    1. Memory, unification, and control: Memory is located in the temple, "unification" (i.e., the construction of complex forms out of simpler ones) in the frontal areas (p. 3).
    2. Unification the formal way: Think of lexical-functional grammar representations and the like. "This morning, I worked all night" does not unify.
    3. Semantics processing as constraint satisfaction: Memory supplies the building blocks, executive control stacks them together (p. 4)
    4. What a theory of semantics processing should do: It should both account for the ability to comprehend for novelty (e.g., "The ship sailed into the living room") and for the averse reactions to it.
  2. MEMORY AND TEMPORAL CORTEX: Simple recall is not enough even for cliches, because context matters too (p. 7).
    1. Memory and the N400: Based on what you've understood so far, you've built up an expectation; if that expectation is violated, you have an N400 response.
    2. The N400 and temporal cortex: "The N400 priming effect has been shown to have primary neuronal generators in temporal cortex" (pp. 8–9) The reference is to a paper by Ellen Lau et al.: "A cortical network for semantics" (2008).
  3. UNIFICATION AND FRONTAL CORTEX: In order to get "a much processing mileage as possible" out of recall routines, the brain looks for "the nearest attractor state at each word-processing step" (pp. 9–10). A footnote explains that this can either be explained with reference to Hopfield networks or predictive coding, but "The theory presented in Sections 4 and 5 is more consistent with the former framework, which is also a more standard approach to language in the brain than Bayes" (note 3, p. 10).
    1. Unification and Integration: Integration basically means selection from a menu based on converging evidence; unification means constructing new menu items in accordance with stated constraints (p. 10).
    2. Unification and the N400: The N400 cannot only track expectation or retrieval, since it is also evoked by sentences like "The journalist began the article," which require a post-hoc reconstrual of the verb "began" (p. 12–13). It is, they say, "hard to see how a noncombinatorical account could explain these data" (p. 13), which are attributed to Baggio et al.: "Coercion and compositionality" (2010) and Kuperberg et al.: "Electrophysiological correlates of complement coercion" (2010). An additional discussion takes up the role of the inferior frontal gyrus in "unification."
    3. Unification and selection: Conventional wisdom has it that the frontal cortex is involved in "controlled retrieval" (p. 14). But word-association tasks show that it is more active when you try to find a word related to "wheel" than when you are looking for a word related to "scissors" (which is much easier and less ambiguous). Moreover, the voice experiments by van Berkum et al. (2008) show that other types of hard-to-combine stimuli activate the inferior frontal gyrus (pp. 14–15).
  4. THE BALANCE BETWEEN MEMORY AND UNIFICATION: The temporal areas do retrieval, integration, and prediction; the frontal areas do unification (p. 15).
    1. Connectivity matters: There are several pathways between the temple and the forehead, including some we had forgotten about (p. 16).
    2. Neurotransmitter dynamics: The temple talks to the forehead using the rapidly decaying transmitters AMPA and GABA; the forehead talks back using a much more slowly decaying transmitter called NMDA (pp. 17–18).
    3. A typical processing cycle: After about 200ms, the temple knows what the visual cortex sees; it then talks to itself to built predictions and to the forehead to trigger competitive selection (p. 19). The forehead cannot talk to itself, but only talk back to the temple (pp. 19–20).
  5. A DYNAMIC ACCOUNT OF THE N400: One first sweep retrieves relevant representations, and a second sweep back checks consistency.
    1. The N400 component: The wave itself is always present and reflects the feedback within the temple (p. 21).
    2. The N400 effect: If a word is unexpected in the sense of having lower "semantic relatedness" to the preceding context, the wave is higher (p. 22). This should allegedly be the result of the forehead talking back to the temple.
    3. Testability and falsifiability: This is all very scientific and stuff. Specifically, "patients with focal lesions in [the frontal areas] BA 45 or BA 47 are expected to show at most the onset of an N400 response [not a full-fledged one,] corresponding, in the theory, to the feed-forward spread of activation from sensory areas and inferior temporal cortex to MTG/STG." (p. 23).
  6. CONCLUSIONS: Thus, the N400 is a reaction to a message from the font of the class: "The theory explains the N400 as the result of the summation of currents injected by frontal cortex due to the local spread of activation to neighbouring neuronal populations (pre-activation). In our theory, pre-activation and unification are not independent step-like processes, suggesting mutually exclusive accounts of the N400" (p. 24).

Friday, May 22, 2015

Brizendine: The Female Brain (2006)

This is a bio-fundamentalist tract insisting that all gender issues can be explained in terms of hormones and brain wiring.

It's got a bunch of references, but they are given in unmarked endnotes which refer, in turn, to a bibliography. This means that it's quite hard to find reference that's supposed to back any specific claim.

So what I want to do here is to make the connection explicit. I'll go through a little snippet of the text, pointing out directly where the information comes from. I've also looked at the underlying texts, and perhaps not surprisingly, it often turns out that Brizendine often refers to things that are irrelevant or have opposite conclusions of what she is claiming.

The part of the text that I focus on comes from chapter six, "Emotion: The Feeling Brain." I include only quotes from the three first subsections,"The Biology of Gut Feelings,""Getting Through to the Male Brain," and "When He Doesn't Respond the Way She Wants Him To." This corresponds to pages 117–126 in the .pdf copy that I got my hands on.

I don't reproduce all references. It's too long anyway.

Claim:
As he starts to speak, her brain carefully searches to see if what he says is congruent with his tone of voice. If the tone and meaning do not match, her brain will activate wildly. Her cortex, the place for analytical thinking, would try to make sense of this mismatch. She detects a subtle incongruence in his tone of voice—it is a little too over-the-top for his protestations of innocence and devotion. His eyes are darting a bit too much for her to believe what he is saying. The meaning of his words, the tone of his voice, and the expression in his eyes do not match. She knows: he is lying. (p. 118–119)
Sources:
  • Schirmer, A., S. A. Kotz, et al. (2002). “Sex differentiates the role of emotional prosody during word processing.” Brain Res Cogn Brain Res 14 (2): 228–33.
  • Schirmer, A., S. A. Kotz, et al. (2005). “On the role of attention for the processing of emotions in speech: Sex differences revisited.” Brain Res Cogn Brain Res 24 (3): 442–52.
  • Schirmer, A., T. Striano, et al. (2005). “Sex differences in the preattentive processing of vocal emotional expressions.” Neuroreport 16 (6): 635–39.
The first study found that "women make an earlier use of emotional prosody during word processing as compared to men," while the second found that "the presence of sex differences in emotional-prosodic priming depends on whether or not participants are instructed to take emotional prosody into account."

The third one seems to have investigated the detection of meaning/tone-of-voice mismatches and the abstract states: "Independent of the listeners' sex, deviants elicited a mismatch negativity in the scalp-recorded event-related potential as an indicator of preattentive acoustic change detection. Only women, however, showed a larger mismatch negativity to emotional than to neutral deviants."

Claim:
Maneuvering like an F-15, Sarah’s female brain is a high-performance emotion machine—geared to tracking, moment by moment, the non-verbal signals of the innermost feelings of others. (p. 119)
Source:
  • Brody, L. R. (1985). “Gender differences in emotional development: A review of theories and research.” J Pers 53:102–49.
From the abstract: "Studies suggest that with development, boys increasingly inhibit the expression and attribution of most emotions, whereas girls increasingly inhibit the expression and recognition of socially unacceptable emotions, e g, anger. These differences may be a function of different socialization processes for males and females, which may be adaptations to innate gender differences in temperament, or adaptations to existing sociocultural pressures."

Claim:
By contrast, Nick, like most males, according to scientists, is not as adept at reading facial expressions and emotional nuance—especially signs of despair and distress. (p. 119)
Source:
  • Hall, L. A., A. R. Peden, et al. (2004). “Parental bonding: A key factor for mental health of college women.” Issues Ment Health Nurs 25 (3): 277–91.
This paper is about the kinds of parent-child relationships that are predictive of depression in college-aged women.This appears to have no relation to the claim made above.

Claim:
Women know things about the people around them—they feel a teenage child’s distress, a husband’s flickering thoughts about his career, a friend’s happiness in achieving a goal, or a spouse’s infidelity at a gut level. (p. 120)
Source:
  • Naliboff, B. D., S. Berman, et al. (2003). “Sex-related differences in IBS patients: Central processing of visceral stimuli.” Gastroenterology 124 (7): 1738–47.
IBS is "irritable bowel syndrome," the stomach disease. From what I understand, the researchers did a PET scan of patients that either currently had a "moderate rectal inflation" or were merely asked to think about one.

Their conclusions were that among the people with inflamed intestines, "women showed greater activation in the ventromedial prefrontal cortex, right anterior cingulate cortex, and left amygdala, whereas men showed greater activation of the right dorsolateral prefrontal cortex, insula, and dorsal pons/periaqueductal gray. Similar differences were observed during the anticipation condition. Men also reported higher arousal and lower fatigue."

So it seems that there are sex differences in the way we react to diarrhea. I'm not sure that tells us much about "gut feelings."

Claim:
Some of this increased gut feeling may have to do with the number of cells available in a woman’s brain to track body sensations. After puberty, they increase. (p. 120)
Source:
  • Leresche, L., L. A. Mancl, et al. (2005). “Relationship of pain and symptoms to
    pubertal development in adolescents.” Pain 118 (1–2): 201–9.
This study showed, based on questionnaires, that an array of different pain symptoms become more common as girls and boys reach puberty. From the abstract:
Prevalence of back pain, headache and [jaw] pain increased significantly … and stomach pain increased marginally with increasing pubertal development in girls. Rates of somatization, depression and probability of experiencing multiple pains also increased with pubertal development in girls (P<0.0001). For boys, prevalence of back … and facial pain … increased, stomach pain decreased somewhat and headache prevalence was virtually unchanged with increasing maturity. For both sexes, pubertal development was a better predictor of pain than was age.
This does not prove anything about the underlying brain substrates. It only tracks who hurt in what way, according to their own reports.

Claim:
The estrogen increase means that girls feel gut sensations and physical pain more than boys do. (p. 120)
Sources:
  • Lawal, A., M. Kern, et al. (2005). “Cingulate cortex: A closer look at its gut-related functional topography.” Am J Physiol Gastrointest Liver Physiol 289(4): G722–30.
  • Derbyshire, S. W., T. E. Nichols, et al. (2002). “Gender differences in patterns of cerebral activation during equal experience of painful laser stimulation.” J Pain 3 (5): 401–11.
The first of these studies is an MRI study of where in the brain the two genders predominantly show activity when they feel or think about pain in the rectum.

They showed some similarities, but also differences in terms of the activity in the front part of the limbic cortex, which according to Wikipedia "appears to play a role in a wide variety of autonomic functions, such as regulating blood pressure and heart rate." There was also a difference depending on whether they consciously thinking about their gut: "In contrast to male subjects, females exhibit increased activity in response to liminal nonpainful stimulation compared with subliminal stimulation suggesting differences in cognition-related recruitment."

But I'd have to check the details to be sure of how the subjects were actually stimulated. But as far as I can see, the paper is not actually about differences in how the genders feel, but only in what their brains look like in the scanner.

The second study, Derbyshire et al, zapped men and women with an equal amount of laser energy to the back of their hand and recorded where their brains lit up:
The female subjects required less laser energy before reporting pain, but the difference was not significant. … There was significantly greater activation in the left, contralateral, prefrontal, primary and secondary somatosensory, parietal, and insula cortices in the male subjects compared with the female subjects and greater response in the perigenual cingulate cortex in the female subjects.
So, "girls feel … physical pain more than boys do"? Perhaps, but this study doesn't conclude that.

Claim:
Some scientists speculate that this greater body sensation in women punches up the brain’s ability to track and feel painful emotions, too, as they register in the body. (p. 120)
Source:
  • Lawal, op. cit.

Claim:
The areas of the brain that track gut feelings are larger and more sensitive in the female brain, according to brain scan studies. (p. 120)
Source:
  • Butler, T., H. Pan, et al. (2005). “Fear-related activity in subgenual anterior cingulate differs between men and women.” Neuroreport 16 (11): 1233–36.
This study trained subjects to have a fear response by giving them mild electric shocks, and then recorded what their brain activity looked like. The women in the study showed more activity in the "deep" brain structures, and the authors concluded that this suggested a "greater susceptibility of women to anxiety."

I'm not sure the paper said anything about the gut, or about the size of the relevant brain structures, though. But perhaps there is a remark about that somewhere in there.

Claim:
Therefore, the relationship between a woman’s gut feelings and her intuitive hunches is grounded in biology. (p. 120)
Source:
  • Levenson, R. W. (2003). “Blood, sweat, and fears: The autonomic architecture of emotion.” Ann NY Acad Sci 1000:348–66.
I don't know exactly what she's referring to here. The paper does have one short section (9 sentences) on the organ responses that follow from emotional reactions though. Its conclusion:
Although the primary role of the [autonomous nervous system] in emotion is usually thought to be providing physiological support for action, many of these autonomic adjustments [e.g., erection, dilation of pupils, sweating] create appearance changes that have strong signal value. Most prominent are those that produce visible changes in color, moisture, protrusion, and in the appearance of the eyes. … That humans make decisions, plan strategies, and regulate their behavior in response to these signs of underlying autonomic activity in others underscores the utility and value of these signs as indicators of emotional states. (pp. 356-357)
So are "a woman’s gut feelings … grounded in biology"? This paper rather seems to say that your literal inner organs have a certain reaction for a certain evolutionary purposes, although the claim is far from fleshed out.

Claim:
When a woman begins receiving emotional data through butterflies in her stomach or a clench in the gut—as Sarah did when she finally asked Nick if he was seeing someone else—her body sends a message back to the insula and anterior cingulate cortex. The insula is an area in an old part of the brain where gut feelings are first processed. The anterior cingulate cortex, which is larger and more easily activated in females, is a critical area for anticipating, judging, controlling, and integrating negative emotions. A woman’s pulse rate jumps, a knot forms in her stomach—and the brain interprets it as an intense emotion. (p. 120)
Sources:
  • Butler, op. cit. (the electric shock study)
  • Pujol, J., A. Lopez, et al. (2002). “Anatomical variability of the anterior cingulate gyrus and basic dimensions of human personality.” Neuroimage 15 (4): 847–55.
The study by Pujol et al. actually does have some relevance here. They measured the size and symmetry of the anterior limbic cortex in 50 men and 50 women, and then saw if they could find any correlations between asymmetries and personality traits. It turns out they could, as women tended to have a more asymmetric cingulate cortices, and be more fearful:
Anatomical data revealed that … a prominent right anterior cingulate was more frequent in women than in men. … Both women and men with larger right anterior cingulate described themselves as experiencing greater worry about possible problems, fearfulness in the face of uncertainty, shyness with strangers, and fatigability. … Our observations suggest that a large right anterior cingulate is related to a temperamental disposition to fear and anticipatory worry in both genders and that a higher prevalence of these traits in women may be coupled with a greater expansion of this brain region.
Note that the personality test they used is based on self-reported character traits. 31 out of the 50 women had a larger right anterior cingulate, while this was true for 21 of the men. About 24% of the variance on the so-called Harm Avoidance personality trait was accounted for by the larger size of the right anterior cingulate.

This study gets bonus points for including images of two brains that look different (p. 848):


Always worth reminding oneself that the goop inside your head or my head may only resemble the standard textbook brain very loosely.

Let's take a step back: What did Brizendine actually claim? First, that the anterior cingulate cortex is "larger and more easily activated in females," and second, that when "a woman" experiences physiological responses like increased pusle, "the brain interprets it as an intense emotion. "

Both of these claims seem to be substantiated by the studies, although the gender differences are quite small. For instance, the average surface area of the anterior cingulate gyrus in women was 946 mm squared (s.d. = 204), while the corresponding number for the men was 918 mm squared (s.d. = 256). Clearly, these two populations do not differ significantly (see also Pujol et al.'s own statistical analysis, p. 850).

Claim:
Jane’s observations were so minute that to Evan she appeared to be reading his mind. This often unnerved him. Jane had watched Evan’s eyes and facial expression and correctly inferred what was going on in his brain. (p. 121)
Source:
  • Rotter, N. G. (1988). “Sex differences in the encoding and decoding of negative facial emotions.” Journal of Nonverbal Behavior 12:139–48.
From their abstract: "Overall, females exceeded males in their ability to recognize emotions whether expressed by males or by females. As an exception, males were superior to females in recognizing male anger. The findings are discussed in terms of social sex-roles."

Claim:
Men don’t seem to have the same innate ability to read faces and tone of voice for emotional nuance. (p. 121)
Sources:
  • Campbell, A. (2005). “Aggression.” In Handbook of Evolutionary Psychology, ed. D. Buss, 628–52. New York: Wiley.
  • Rosip, J. C., J. A. Hall (2004). “Knowledge of nonverbal cues, gender, and non-verbal decoding accuracy.” Journal of Nonverbal Behavior, Special Interpersonal Sensitivity, Pt. 2. 28 (4): 267–86.
  • Weinberg, M. K. (1999). “Gender differences in emotional expressivity and self-regulation during early infancy.” Dev Psychol 35 (1): 175–88.
Strange that she would use the word "innate" here without providing any documentation.

Claim:
A study at California State University, Sacramento, of psychotherapists’ success with their clients showed that therapists who got the best results had the most emotional congruence with their patients at meaningful junctures in the therapy. These mirroring behaviors showed up simultaneously as the therapists comfortably settled into the climate of the clients’ worlds by establishing good rapport. All of the therapists who showed these responses happened to be women. Girls are years ahead of boys in their ability to judge how they might avoid hurting someone else’s feelings or how a character in a story might be feeling. (pp. 121–122)
Sources:
  • Raingruber, B. J. (2001). “Settling into and moving in a climate of care: Styles and patterns of interaction between nurse psychotherapists and clients.” Perspect Psychiatr Care 37 (1): 15–27.
  • McClure, E. B. (2000). “A meta-analytic review of sex differences in facial expression processing and their development in infants, children, and adolescents.” Psychol Bull 126 (3): 424–53.
  • Hall, J. A. (1978). “Gender effects in decoding nonverbal cues.” Psychol Bull 85: 8845–57.
  • Hall, J. A. (1984). Nonverbal sex differences: Communication accuracy and expressive style. Baltimore: Johns Hopkins University Press.

Claim:
This ability might be the result of the mirror neurons firing away, allowing girls not only to observe but also to imitate or mirror the hand gestures, body postures, breathing rates, gazes, and facial expressions of other people as a way of intuiting what they are feeling. (p. 122)
Source:
  • None provided.

Claim:
Sometimes, other people’s feelings can overwhelm a woman. My patient Roxy, for example, gasped every time she saw a loved one hurt him- or herself—even when they did something as minor as stub a toe—as if she were feeling their pain. Her mirror neurons were overreacting, but she was demonstrating an extreme form of what the female brain does naturally from childhood and even more in adulthood—experience the pain of another person. (p. 122)
Sources:
  • Singer, T., B. Seymour, et al. (2004). “Empathy for pain involves the affective but not sensory components of pain.” Science 303 (5661): 1157–62.
  • Idiaka, T. (2001). “Age-related differences in the medial temporal lobe responses to emotional faces.” Society forNeuroscience, New Orleans.
  • Zahn-Waxler, C., B. Klimes-Dougan, et al. (2000). “Internalizing problems of
    childhood and adolescence: Prospects, pitfalls, and progress in understanding the development of anxiety and depression.” Dev Psychopathol 12 (3): 443–66.
From Singer et al.: "We assessed brain activity in the female partner while painful stimulation was applied to her or to her partner's right hand through an electrode attached to the back of the hand." They did not perform the reverse study, with the women being zapped.

Idiaka et al. had their subject perform a gender discrimination task while they looked at happy and sad faces, but their study did not actually check for any gender differences in performance.

I haven't been able to get a hold of the last paper, but its abstract makes no reference to a gender dimension.

Claim:
At the Institute of Neurology at University College, London, researchers placed women in an MRI machine while they delivered brief electric shocks, some weak and some strong, to their hands. Next, the hands of the women’s romantic partners were hooked up for the same treatment. The women were signaled as to whether the electric shock to their beloveds’ hands were weak or strong. The female subjects couldn’t see their lovers’ faces or bodies, but even so, the same pain areas of their brains that had activated when they themselves were shocked lit up when they learned their partners were being strongly shocked. The women were feeling their partners’ pain. Like walking in another’s brain, not just his shoes. Researchers have been unable to elicit similar brain responses from men. (pp. 122-123)
Sources:
  • Singer, op. cit.
  • Singer, T., B. Seymour, et al. (2006). “Empathic neural responses are modulated by the perceived fairness of others.” Nature 439 (7075): 466–69.
These remarks mix up the conclusion of two different studies. This is seriously misleading and not scholarly defensible.

As mentioned before, the first study did not test the men and therefore naturally did not "elicit similar brain responses" from them.

The second study was not a your-lover-is-in-pain study, but an MRI study that investigated how men and women differed in brain response when they were playing a game against a cheater. It concluded that men "empathize with fair opponents while favouring the physical punishment of unfair opponents."

Claim:
In a study on the aftereffects of frightening films, women were more likely to lose sleep than men.
Source:
  • Harrison, K., ed. (1999). “Tales from the screen: Enduring fright reactions to scary movies.” Media Psychology, Spring: 15–22.
Sounds reasonable, but is plucked out of thin air: "Sex was not a significant predictor" (p. 108); "we did not find sex differences" (p. 113).

This is clearly a blatant lie, and a quite dumb one at that.

Claim:
In the male brain, most emotions trigger less gut sensation and more rational thought.
Sources:
  • Naliboff, op. cit. (the PET scan of people with inflamed bowels)
  • Wrase, J., S. Klein, et al. (2003). “Gender differences in the processing of standardized emotional visual stimuli in humans: A functional magnetic resonance imaging study.” Neurosci Lett 348 (1): 41–45.
Another startlingly literal reading of "gut feeling."

The second study showed pictures to 10 men and 10 women and scanned their brains:
Men and women showed no significant difference in valence, arousal, skin conductance response and startle modulation. Only in men was amygdala activation observed in the pleasant condition. Furthermore, men showed a stronger brain activity for positive visual stimuli than women in the frontal lobe (inferior and medial frontal gyrus). In women, stronger brain activation for affectively negative pictures was observed in the anterior and medial cingulate gyrus.
So now the amygdala counts as being rational? That's news to me.

Claim:
A woman, because of her expert ability to read faces, will recognize the pursed lips, the squeezing around the eyes, and the quivering corners of the mouth as preludes to crying. A man will not have seen this buildup, so his response is usually “Why are you crying? Please don’t make such a big deal out of nothing. Being upset is a waste of time.” Researchers conclude that this typical scenario means the male brain must go through a longer process to interpret emotional meaning. (p. 124)
Sources:
  • McClure, E. B., C. S. Monk, et al. (2004). “A developmental examination of gender differences in brain engagement during evaluation of threat.” Biol Psychiatry 55 (11): 1047–55.
  • Lynam, D. (2004). “Personality pathways to impulsive behavior and their relations to deviance: Results from three samples.” Journal of Quantitative Criminology 20:319–41.
  • Dahlen, E. (2004). “Boredom proneness in anger and aggression: Effects of impulsiveness and sensation seeking.” Personality and Individual Differences 37:1615–27.
  • Hall, J. A., J. D. Carter, and T. G. Horgan (2000). “Gender differences in the nonverbal communication of emotion.” In A. H. Fischer, ed., Gender and Emotion: Social Psychological Perspectives, 97–117. London: Cambridge University Press.
I couldn't find the first of these studies. The other one seems to be, as the title says, about "impulsive behavior," and I would be surprised if it concludes that "the male brain must go through a longer process to interpret emotional meaning." The third shows that bored people are more aggressive.

The fourth paper is a book chapter. Hall et al. report that there are "relatively large gender differences," but comment:
The word "relatively" is important here. In absolute terms psychological gender differences tend to be rather small. However, the nonverbal differences are larger than many other psychological gender differences (including cognitive skills, attitudes, personality, and other social behaviors) (Hall, 1998). (p. 98)
They are also cautious about biological determinism, reminding the reader that
it is not necessary to posit that differences between males' and females' nonverbal behaviors and skills have evolved biologically (pp. 98–99)
What exactly do "researchers conclude"?

Well, Hall and colleagues conclude that women smile more and are better at it, as well as better at recognizing smiles (p. 112). Had Brizendine only said that, the reference would have provided support for her assertion. Hall et al. do not, however, unambigously attribute this to any specific anatomical or biological cause, so Brizendine is on her own with that claim.

Claim:
Tears in a woman may evoke brain pain in men. The male brain registers helplessness in the face of pain, and such a moment can be extremely difficult for them to tolerate. (p. 124)
Sources:
  • Campbell, A. (1993). Out of Control: Men, Women and Aggression. New York: Basic Books.
  • Campbell, A. (2005). “Aggression.” (see above)
  • Levenson 2003;
  • Frey, W. (1985). “Crying: The mystery of tears.” Winston Pr (September 1985).

Claim:
One study showed that newborn girls, less than twenty-four hours old, respond more to the cries of another baby—and to human faces—than do boys. (p. 125)
Source:
  • McClure, E. B. (2000). “A meta-analytic review of sex differences in facial expression processing and their development in infants, children, and adolescents.” Psychol Bull 126 (3): 424–53.
I haven't looked this up, but shouldn't a true biofundamentalist predict that women would get better rather than worse at this when they came of child-bearing age? I mean, just for the sake of argument.

Claim:
Men pick up the subtle signs of sadness in a female face only 40 percent of the time, whereas women can pick up these signs 90 percent of the time. (p. 125)
Source:
  • Erwin, R. J., R. C. Gur, et al. (1992). “Facial emotion discrimination: I. Task construction and behavioral findings in normal subjects.” Psychiatry Res 42(3): 231–40.
According to the abstract, 24 males and 10 females were asked to look at pictures of male and female faces and judge whether they were happy or sad. The findings were that "males had higher sensitivity scores for the detection of sad emotion. …  Compared with female subjects, male subjects (n = 10) were selectively less sensitive to sad emotion in female faces. Female subjects (n = 10) were more sensitive overall to emotional expression in male faces than in female faces."

10 out of 24 is 41.7%, so I guess that were that number of comes from. 10 out of 10 is more than 90% though, so how did she get that number?

Now that we're at it, how about some confidence intervals? Using Chebyshev's inequality with a 95% confidence level, I find that the first number comes with an uncertainty of up to about 44 percentage points, while the second has an uncertainty of up to 71 percentage points. So let's not build an empire on that result.

Final point, why "subtle"? I haven't seen the experimental materials, but knowing what I know about experimental psychology, I'm sure they were all but subtle.

Claim:
And while men and women are both comfortable being physically close to a happy person, only women report that they feel equally comfortable being close to someone sad.
Source:
  • Mandal, M. K. (1985). “Perception of facial affect and physical proximity.” Percept Mot Skills 60 (3): 782.
When she says "physically close to," she means literally physically close to: "The experimental procedure required each S to stand 12 ft away from a life-size facial emotion projected on a screen and to step forward as close to that expression as possible to feel comfortable for interaction." In terms of the mean number of inches, "men preferred to be closer to an expression of happiness (M = 28.0 in.) than to one of sadness (M = 41.9 in.), while women approached both almost equally (sad: M = 23.5 in., happiness: M = 24.5 in.)."

So this claim seems to be warranted by the paper, quite literally.

Claim:
Men are used to avoiding contact with others when they themselves are going through an emotionally rough time. They process their troubles alone and think women would want to do the same. (pp. 125-126)
Source:
  • Cross, S. E., and L. Madson (1997). “Models of the self: Self-construals and gender.” Psychol Bull 122 (1): 5–37.
I don't have access to this paper, so I can evaluate this claim entirely. But guessing from the abstract, I think the paper would support the statement that men are supposed to act independent, but not necessarily that they believe women to do the same.

Saturday, May 16, 2015

Neyman and Pearson: "On the Problem of the Most Efficient Tests of Statistical Hypotheses" (1933)

The Wikipedia page on the Neyman-Pearson lemma provides a statement and proof of the theorem which deviates quite a lot from the corresponding formulations in the article in which it originally appeared (pp. 300–301). I've been thinking a bit about how best to state and prove this theorem, and here's what I've come up with.

Dichotomies and Tests

Suppose two probability measures $P_0$ and $P_1$ on a set $\Omega$ are given, and that we are given a data set $X=x$ drawn from $\Omega$ according to one of these two distributions. Our goal is now to define a function $T: \Omega\rightarrow \{0,1\}$ which will return, for any data set, a guess at which of the two hypotheses the data came from.

Such a test will thus itself be a random variable $T=T(X)$. The quality of this test can be measured in terms of two error probabilities,
$$
P_0(T=1) \qquad \textrm{and} \qquad P_1(T=0).
$$Minimizing these two sources of error will generally be conflicting goals, and can be seen by considering the constant functions $T=0$ and $T=1$.

Likelihood Ratio Tests and the Neyman-Pearson Lemma

One family of tests we always have at our disposal are the likelihood ratio tests. A likelihood ratio test at threshold $r$ is the indicator variable that returns the value 1 when the evidence supports hypothesis 1 by a factor of more than $r$:
$$
R(x) \; = \; \mathbb{I}\left( \frac{P_1(x)}{P_0(x)} \geq r \right).
$$The content of the Neyman-Pearson lemma is that these likelihood ratio tests are the only optimal ones, in a certain sense.

Specifically, suppose any other test $T$ is given. The lemma then states that if
$$
P_0(T=1) \; \leq \; P_0(R=1) \qquad \Longrightarrow \qquad P_1(T=0) \; \geq \; P_1(R=0).
$$In other words, when the probability of an error of the first kind is held below a fixed level, we achieve the lowest rate of errors of the other kind by choosing a likelihood ration test. You cannot achieve a lower error rate under $P_0$ without increasing your error rate under $P_1$.

From $1\times 2$ to $2 \times 2$

The best way of proving this is to directly compare the regions of $\Omega$ on which the two tests disagree. There are two of these regions, $T=1 \wedge R=0$ and $T=0 \wedge R=1$. The difference in the measure of these two regions under the two hypotheses determine the difference in error rates for the two tests.

A paraphrase of the lemma could then be that if
$$
P_0(T=1 \wedge R=0)  \;  \leq  \; P_0(T=0 \wedge R=1),
$$then we also have
$$
P_1(T=0 \wedge R=1)  \;  \geq  \; P_1(T=1 \wedge R=0).
$$This alternative formulation can be translated back into the original form by adding back in the corner regions $T=1 \wedge R=1$ and $T=0 \wedge R=0$ on which the two tests give the same result.

Ratio Conversions

Using this formulation, a proof strategy suggests itself: Looking at cases according to the value of $R$. Specifically, if an event $A\subseteq \Omega$ satisfies the condition
$$
A \; \subseteq \; \{x:\;R(x)=0\} \; = \; \left\{ x:\; \frac{P_1(x)}{P_0(x)}  \; <  \;  r\right\},
$$then $\frac{1}{r}P_1 < P_0 $ on $A$. We can therefore obtain a lower bound on the $P_0$-measure of $A$ by performing the integration using the smaller measure $\frac{1}{r}P_1$ instead:
$$
\frac{1}{r} P_1(A) \; <  \; P_0(A).
$$Similarly, if
$$
A \; \subseteq \;  \{ R=1 \} \; = \; \left\{ \frac{P_1}{P_0}  \; \geq  \;  r\right\},
$$then $\frac{1}{r}P_1 \geq P_0$ on $A$, and
$$
P_0(A) \; \leq \; \frac{1}{r} P_1(A).
$$Since the two sets $R=0$ and $R=1$ form a partition of $\Omega$, we can split any set $A$ up according to its overlap with these cells and thus translate bounds on $P_0$ into bounds on $P_1$.

The Extended Sandwich

Now, by applying these considerations to the condition
$$
P_0(T=1 \wedge R=0)  \;  \leq  \; P_0(T=0 \wedge R=1),
$$we obtain the result that
$$
\frac{1}{r} P_1(T=1 \wedge R=0)  \;  \leq  \; \frac{1}{r} P_1(T=0 \wedge R=1).
$$By cancelling the $\frac{1}{r}$, we thus get the result.

I like this formulation of the lemma, because it makes it vivid what it is that's going on: When somebody splits up $\Omega$ into $\{T=0\}$ and $\{T=1\}$, we use our ratio test $R$ to further split these up into two subregions each. This allows us to use the translation between the two measures, and thus to compare the $P_0$ error rates with the $P_1$ error rates.

It also gives better intuitions about the potential differences between the tests $R$ and $T$: The more the set $T=1$ overlaps with the set $R=1$, the more the two tests are alike, and the more squeezed will the sandwich of inequalities be.

Wednesday, May 13, 2015

Brynjolfson, McAfee, and Spence: "New World Order" (2014)

In this article, two MIT management researchers and the economist Michael Spence argue that the increasing use of automation in production is for the benefit of capital, consumers, and the top of the economic pyramid, but tends to leave the vast majority with stagnating wages.

Here are some quotes:
Turn over your iPhone and you can read an eight-word business plan that has served Apple well: "Designed by Apple in California. Assembled in China." … More and more companies have been riding the two great forces of our era—technology and globalization—to profits. (p. 45)
Even as the globalization story continues, however, an even bigger one is starting to unfold: the story of automation, including artificial intelligence, robotics, 3-D printing, and so on. And this second story is surpassing the first, with some of its greatest effects destined to hit relatively unskilled workers in developing nations.
Visit a factory in China's Guangdong Province, for example, and you will see thousands of young people working day in and day out on routine, repetitive tasks, such as connecting two parts of a keyboard. Such jobs are rarely, if ever, seen anymore in the United States or the rest of the rich world. But they may not exist for long in China and the rest of the developing world either, for they involve exactly the type of tasks that are easy for robots to do. As intelligent machines become cheaper and more capable, they will increasingly replace human labor, especially in relatively structured environments such as factories and especially for the most routine and repetitive tasks. To put it another way, offshoring is often only a way station on the road to automation. (p. 46)
The growing capabilities of automation threaten one of the most reliable strategies that poor countries have used to attract outside investment: offering low wages to compensate for low productivity and skill levels. And the trend will extend beyond manufacturing. Interactive voice response systems, for example, are reducing the requirements for direct person-to-person interaction, spelling trouble for call centers in the developing world. Similarly, increasingly reliable computer programs will cut into transcription work now often done in the developing world. In more and more domains, the most cost-effective source of "labor" is becoming intelligent and flexible machines as opposed to low-wage humans in other countries. (pp. 46–47)
Network effects, whereby a product becomes more valuable the more users it has, can also generate these kinds of winner-take-all or winner-take-most markets. Consider Instagram, the photo-sharing platform, as an example of the economics of the digital, networked economy. The 14 people who created the company didn't need a lot of unskilled human helpers to do so, nor did they need much physical capital. They built a digital product that benefited from network effects, and when it caught on quickly, they were able to sell it after only a year and a half for nearly three-quarters of a billion dollars—ironically, months after the bankruptcy of another photography company, Kodak, that at its peak had employed some 145,000 people and held billions of dollars in capital assets. (p. 50)
But as research by one of us (Brynjolfson) and Heekyung Kim has shown, a portion of the growth [in wages for top executives] is linked to the greater use of information technology. Technology expands the potential reach, scale, and monitoring capacity of a decision-maker by magnifying the potential consequences of his or her choices. Direct management via digital technologies makes a good manager more valuable than in earlier times, when executives had to share control with long chains of subordinates and could affect only a smaller range of activities. (pp. 50–51)

Tuesday, May 12, 2015

Miller: "The Magical Number Seven, Plus or Minus Two" (1956)

This classic paper was based on a lecture George Miller gave in 1955. It explains his thoughts on the limits of human information processing and the mnemonic techniques we can use to overcome them.

There are several .html and .doc transcriptions of the text available online. Scans of the original text as it appeared in the Psychological Review can also be found here, here, here, here, here, and here.

The Rubbery Transmission Rate

Miller opens the paper by considering a number of experiments that suggest that people can distinguish about four to ten objects when they only vary on a single dimension. He reports, for instance, on experiments with sounds that differ in volume, chips that differ in color, and glasses of water that differ in salt content.

Miller's Figures 1, 2, 3, and 4 (pp. 83, 83, 85, and 85)

These results suggests a deeply rooted human trait, he submits:
There seems to be some limitation built into us either by learning or by the design of our nervous systems, a limit that keeps our channel capacities in this general range. On the basis of the present evidence it seems safe to say that we possess a finite and rather small capacity for making such unidimensional judgments and that this capacity does not vary a great deal from one simple sensory attribute to another. (p. 86)
The problem with this claim is that the actual information content differs hugely depending on what kinds of items you remember: Five English words add up to 50 bits, seven digits amount to 23 bits, and, naturally, eight binary digits are 8 bits. So something is clearly going wrong with this hypothesis:
For example, decimal digits are worth 3.3 bits apiece. We can recall about seven of them, for a total of 23 bits of information. Isolated English words are worth about 10 bits apiece. If the total amount of information is to remain constant at 23 bits, then we should be able to remember only two or three words chosen at random. (p. 91)
But this uniformity is of course not what we observe.

Dits and Dots

To work around the aporia, Miller introduces an ad hoc concept:
In order to capture this distinction in somewhat picturesque terms, I have fallen into the custom of distinguishing between bits of information and chunks of information. Then I can say that the number of bits of information is constant for absolute judgment and the number of chunks of information is constant for immediate memory. The span of immediate memory seems to be almost independent of the number of bits per chunk, at least over the range that has been examined to date. (pp. 92–93)
For example:
A man just beginning to learn radio-telegraphic code hears each dit and dah as a separate chunk. Soon he is able to organize these sounds into letters and then he can deal with the letters as chunks. Then the letters organize themselves as words, which are still larger chunks, and he begins to hear whole phrases. … In the terms I am proposing to use, the operator learns to increase the bits per chunk. (p. 93)

Something for Nothing

Miller goes on to report on an experiment carried out by someone named Sidney Smith (not included in the bibliography). By teaching himself to translate binary sequences into integers, he managed to push up his memorization ability to about 40 binary digits. Miller comments:
It is a little dramatic to watch a person get 40 binary digits in a row and then repeat them back without error. However, if you think of this merely as a mnemonic trick for extending the memory span, you will miss the more important point that is implicit in nearly all such mnemonic devices. The point is that recoding is an extremely powerful weapon for increasing the amount of information that we can deal with. (pp. 94–95)
That's clearly true, but Miller gives no hint as to what the relationship is between this chunking technique and the information-theoretical concepts with which he started.

Mathematically speaking, an encoding is a probability distribution, and a recoding is just a different probability distribution. Learning something does not magically increase your probability budget; the only way you can make some codewords shorter is to make others longer. Simply mapping random binary expansions to equally random decimal expansions should not make a difference. So it's hard to see what the connection between bits and chunks should be.