Notebooks on Language: Ravi and Knight: "Bayesian Inference for Zodiac and Other Homophonic Ciphers" (2011)

Friday, May 17, 2013

Ravi and Knight: "Bayesian Inference for Zodiac and Other Homophonic Ciphers" (2011)

A homophonic cipher is a one-to-many encryption scheme — that is, it can substitute a single plaintext letter with several different cipher symbols. The advantage of using such one-to-many mappings (rather than one-to-one substitution ciphers) is that they give a flatter output distribution when the number of cypher symbols per plaintext letter is proportional to the frequency of that letter.

Yet, they are not perfect codes; although the monogram frequencies in the ciphertext can be close to uniform, the skewed distribution of various cipher bigrams and trigrams constrain the possible encryption schemes that are likely to have been used. If you can find an intelligent way of navigating through the space of possible encryption schemes, such schemes may thus still be cracked.

Bayesian Deciphering

This is what Sujith Ravi and Kevin Knight do in this paper. They have a quite straightforward model of the English language and a not quite so straightforward model of homophonic encryption, and they use these two models to compute the posterior probability of a plaintext string given the cipher string. However, since the space of possible encryption schemes is astronomical, they need to select their hypotheses in a clever way.

The technique they apply to accomplish this is Gibbs sampling — that is, they repeatedly resample one dimension of their current position at a time. In the current context, this means conjecturing a new plaintext letter as the preimage of a specific cipher symbol while keeping all other hypotheses constant.

Because the surrounding text decrypted per assumption, different conjectures will have different posterior probabilities, determined by the prior probability of the plaintext strings that they correspond to. Walking around on the space of encryption schemes this way, the model will spend most of its time at places where the plaintext hypotheses have high probability (i.e., good "Englishness").

Cut-and-Paste Resampling

There is a further technical quirk of their model which I'm not quite sure how they implemented: They state (p. 244) that when they resample the preimage of some cipher symbol, they snip out all the windows containing the relevant cipher symbol and glue it to the end of the cipher instead.

If I understand this correctly, it means this:

Suppose your cipher is IFMMPSPQME and your current decryption hypothesis is hellewerld. You then resample the cipher symbol P, perhaps selecting o as its preimage (instead of e).

Since P occurs in the two contexts IFMMPSPQME and IFMMPSPQME, you thus snip these out of the cryptogram, leaving IFM ME (whitespace only included for readability).

Pasting the two cut-outs at the end, you obtain IFM ME MPS SPQ. You then evaluate the posterior probability of this hypothesis by asking your language model for the probability of helldlowwor.

In fact, the idea is a little more complicated than this (and not quite as unreasonable), as the window sizes are determined by possible word boundaries. In the example above, a much larger window might in fact be snipped out, since the algorithm would plausibly recognize some word boundaries around helle and werld, or around hello and world (I don't know exactly when they decide on where the word boundaries go).

The Language and Channel Models

The language model that Ravi and Knight use is slightly unusual in two ways:

It gradually adapts to the bigrams already seen in the plaintext. Letters towards the end of the hypothesized plaintext source are thus selected according to what happened in the beginning of the text (according to the plaintext hypothesis).
It combines a model based on word frequencies (with 90% weight) and a model based on n-gram frequencies (with 10% weight).

With respect to the latter point, I suppose they must be using some kind of quite generous smoothing; the Zodiac cipher they crack has several spelling mistakes and contains 8 non-existing words out of 100.

I also don't know how they decide where to put the word boundaries, but this is a problem that can be solved efficiently with dynamic programming techniques. As they comment (p. 245), the n-gram model is going to do most of the discriminatory work in the beginning (when the encryption hypothesis is still largely random), but as the hypotheses get more and more accurate, the word-based model will start to drive the sampling to a higher extent.

Internal Training

The adaptive part of the language model is expressed by the fraction on the bottom of page 242. This fraction can be split into two parts,

a bigram model trained on an external corpus;
a bigram model trained on the left-hand part of the plaintext hypothesis.

These two models are given weights a/(a + k) and k/(a + k), respectively, where k counts the occurrences of the preceding letter in the hypothesized plaintext string left of the current position; a is a hyperparamter which Ravi and Knight set to a = 10,000.

Thus, as k increases the corpus model is given progressively less weight, and the cipher model is given progressively more. In other words, the more decrypted data we have, the more we trust the cipher model. Since a is as high as it is, the internal model never gets anywhere near outweighing the external.

I'm actually not quite sure whether the "left of" part in k makes any difference, since they glue the windows around the resampled cipher symbol onto the end of the cipher anyway. But maybe I'm missing something.

Notebooks on Language