Showing posts with label noisy channel decoding. Show all posts
Showing posts with label noisy channel decoding. Show all posts

Tuesday, September 17, 2013

Geskell and Marslen-Wilson: "Lexical Ambiguity Resolution and Spoken Word Recognition" (2001)

If you pronounce the phrase
  • worn building
and you do this relatively quickly, there is a good chance that it comes out as something close to
  • ['wɔɹm'bɪldɪŋ]
instead of
  • ['wɔɹn'bɪldɪŋ].
This phenomenon is known as consonant assimilation. In this particular case, it happens because the peripheral consonant /b/ in the beginning of building makes it difficult to pronounce a coronal consonant such as the /n/ in worn, compared to another peripheral consonant such as the /m/ in warm. Your lips simply have to move less to say things like em-beh than to say things like en-beh.

The lip position used for an [m] is close to that used for a [b]
(picture from an online book by Michael Gasser)

In principle, this means that the sound string ['wɔɹm'bɪldɪŋ] involves an ambiguity for the hearer: Was the intended original message worn building or warm building? Both are plausible given the observed signal because they are both occasionally pronounced in the same way.

Assimilation effects can thus — in certain, relatively rare, cases — add another decoding problem to the already quite substantial ambiguity of words like warm.

Noisy Channel Hearing

The raises a question about the psycholinguistics of hearing: What would happen if we plugged a sound ambiguity like this into an experimental paradigm designed to track the process of word sense selection? Would people perhaps show traces of an active inference from sound to word, and competition between various hypotheses?

This is the question investigated by this paper by Gareth Gaskell and William Marslen-Wilson. Their idea is to play back an ambiguous sound string to their subjects, and then check if they are faster at recognizing a word on a computer screen if that word could have been the source of the sound string before consonant assimilation. For instance:
  • Voice: The ceremony was held in June and the sunny weather added to the air of celebration. An article about the bribe made the [Screen: bride] local paper.
  • Voice: The conditions in the outback were difficult for driving. In the intense heat, the mug cracked up [Screen: mud] completely.
  • Voice: We were impressed by her stylish delivery and intonation. Jane finished off the seam beautifully. [Screen: scene]
These test sentences are then compared to another condition in which the phonetics of the sentences do not warrant any backwards inference to a different sound form:
  • Voice: The ceremony was held in June and the sunny weather added to the air of celebration. An article about the bribe turned up [Screen: bride] in the local paper.
  • Voice: The conditions in the outback were difficult for driving. In the intense heat, the mug turned to [Screen: mud] dust.
  • Voice: We were impressed by her stylish delivery and intonation. Jane finished off the seam deftly. [Screen: scene]
These sentences cannot have come about by assimilation effects, so there is no basis for a reconstructive inference. For instance, bribe turned is not easier to pronounce than bride turned, so there is no reason to hypothesize that bribe as a distorted form of bride.

In the Face of Overwhelming Evidence

The main result of the whole paper is that there is indeed a significant difference between the cases where the phonological context supports an inference (e.g., mug cracked) and the cases where it doesn't (e.g., mug turned). This is, however, only the case if the discursive context also strongly suggests the same reconstructive inference (Experiment 3).

It's also worth noting that the effects are tiny. On average, subjects took 522 milliseconds to recognize the phonologically warranted form (e.g., mud from mug cracked) and 537 milliseconds to recognize the phonologically unwarranted (e.g. mud from mug turned). This is a difference of 15 milliseconds, or a drop of 2.8% in decision time. It's statistically significant, but it's not big.

They also found that if you remove the discursive bias from the materials, this effect disappears (Experiment 1 and 2). There is, for instance, no priming effect in the following sentence:
  • Voice: An article about the bribe turned up [Screen: bride] in the local paper.
It is thus only when both discursive and phonological context supports the inference that it leaves a measurable trace.

It is conceivable that there is an activation effect in the other case as well, but that it simply is so miniscule that we can't see it. But at any rate, this finding makes sense if we think about the inference as a kind of naive Bayes collection of evidence in favor of a hypothesis.

Friday, May 17, 2013

Ravi and Knight: "Bayesian Inference for Zodiac and Other Homophonic Ciphers" (2011)

A homophonic cipher is a one-to-many encryption scheme — that is, it can substitute a single plaintext letter with several different cipher symbols. The advantage of using such one-to-many mappings (rather than one-to-one substitution ciphers) is that they give a flatter output distribution when the number of cypher symbols per plaintext letter is proportional to the frequency of that letter.

Yet, they are not perfect codes; although the monogram frequencies in the ciphertext can be close to uniform, the skewed distribution of various cipher bigrams and trigrams constrain the possible encryption schemes that are likely to have been used. If you can find an intelligent way of navigating through the space of possible encryption schemes, such schemes may thus still be cracked.

Bayesian Deciphering

This is what Sujith Ravi and Kevin Knight do in this paper. They have a quite straightforward model of the English language and a not quite so straightforward model of homophonic encryption, and they use these two models to compute the posterior probability of a plaintext string given the cipher string. However, since the space of possible encryption schemes is astronomical, they need to select their hypotheses in a clever way.

The technique they apply to accomplish this is Gibbs sampling — that is, they repeatedly resample one dimension of their current position at a time. In the current context, this means conjecturing a new plaintext letter as the preimage of a specific cipher symbol while keeping all other hypotheses constant.

Because the surrounding text decrypted per assumption, different conjectures will have different posterior probabilities, determined by the prior probability of the plaintext strings that they correspond to. Walking around on the space of encryption schemes this way, the model will spend most of its time at places where the plaintext hypotheses have high probability (i.e., good "Englishness").

Cut-and-Paste Resampling

There is a further technical quirk of their model which I'm not quite sure how they implemented: They state (p. 244) that when they resample the preimage of some cipher symbol, they snip out all the windows containing the relevant cipher symbol and glue it to the end of the cipher instead.

If I understand this correctly, it means this:

Suppose your cipher is IFMMPSPQME and your current decryption hypothesis is hellewerld. You then resample the cipher symbol P, perhaps selecting o as its preimage (instead of e).

Since P occurs in the two contexts IFMMPSPQME and IFMMPSPQME, you thus snip these out of the cryptogram, leaving IFM ME (whitespace only included for readability).

Pasting the two cut-outs at the end, you obtain IFM ME MPS SPQ. You then evaluate the posterior probability of this hypothesis by asking your language model for the probability of helldlowwor.

In fact, the idea is a little more complicated than this (and not quite as unreasonable), as the window sizes are determined by possible word boundaries. In the example above, a much larger window might in fact be snipped out, since the algorithm would plausibly recognize some word boundaries around helle and werld, or around hello and world (I don't know exactly when they decide on where the word boundaries go).

The Language and Channel Models

The language model that Ravi and Knight use is slightly unusual in two ways:
  1. It gradually adapts to the bigrams already seen in the plaintext. Letters towards the end of the hypothesized plaintext source are thus selected according to what happened in the beginning of the text (according to the plaintext hypothesis).
  2. It combines a model based on word frequencies (with 90% weight) and a model based on n-gram frequencies (with 10% weight).
With respect to the latter point, I suppose they must be using some kind of quite generous smoothing; the Zodiac cipher they crack has several spelling mistakes and contains 8 non-existing words out of 100.

I also don't know how they decide where to put the word boundaries, but this is a problem that can be solved efficiently with dynamic programming techniques. As they comment (p. 245), the n-gram model is going to do most of the discriminatory work in the beginning (when the encryption hypothesis is still largely random), but as the hypotheses get more and more accurate, the word-based model will start to drive the sampling to a higher extent.

Internal Training

The adaptive part of the language model is expressed by the fraction on the bottom of page 242. This fraction can be split into two parts,
  1. a bigram model trained on an external corpus;
  2. a bigram model trained on the left-hand part of the plaintext hypothesis.
These two models are given weights a/(a + k) and k/(a + k), respectively, where k counts the occurrences of the preceding letter in the hypothesized plaintext string left of the current position; a is a hyperparamter which Ravi and Knight set to a = 10,000.

Thus, as k increases the corpus model is given progressively less weight, and the cipher model is given progressively more. In other words, the more decrypted data we have, the more we trust the cipher model. Since a is as high as it is, the internal model never gets anywhere near outweighing the external.

I'm actually not quite sure whether the "left of" part in k makes any difference, since they glue the windows around the resampled cipher symbol onto the end of the cipher anyway. But maybe I'm missing something.