I haven't written here in a while, but I've recently learned a few things I want to write down for later reference. The goal of the current post is to sketch a proof given by Khalil Sima'an for the NP-completeness of the probability-assignment problem under data-oriented parsing.
Background: Data-Oriented Parsing
Data-oriented parsing (DOP) is type of probabilistic language model that can be extracted from a treebank. Unlike probabilistic context-free grammar, the language model does not consist of non-terminal expansion probabilities, but is rather of a distribution over incomplete parse trees. These probabilities are estimated by snipping every subtree of every tree in a data set of manually parsed sentences.
Because the distribution created by the model can assign probability to very large parse trees, this scheme can encode a lot of memory and cross-dependency. Whereas probabilistic context-free grammars are Markovian (the expansion of a non-terminal depends only on the type of that non-terminal), data-oriented parsing models can encode arbitrarily long dependencies within the tree.
The tree-joining operation used to derive parse trees in DOP.
DOP models derive parse trees by joining incomplete trees together until all of their leaves are terminals. A single parse tree can therefore have several derivations. DOP models thus have one more source of ambiguity in addition to the structural ambiguity present in context-free grammars. This is the reason why many data-oriented parsing problems are harder than the corresponding parsing problems in formalisms in which parse trees are primitives.
An One-Dimensional Analogy
The difference between a DOP model and a context-free grammar is analogous to difference between a Markov model and a language model that can output words of any length.
We can fit such a distribution to a corpus by setting the probability of a word proportional to its frequency in the corpus, regardless of its length. The corpus ABC then results in a uniform distribution over
{A, B, C, AB, BC, ABC}.
Such a language model can obviously produce lots of long sequences of reasonable-sounding text, at the cost of having enormous memory requirements. The number of substrings is quadratic in the length of the corpus.
Data-oriented parsing similarly encodes long-distance memory within the parse tree by simply memorizing large, ready-made subtrees. An inductive argument shows that the number of subtrees of a tree is exponential in the depth of the tree.
A very large DOP grammar derived from just two sentences.
In both cases, it is possible to introduce rules that bias the distributions to favor shorter or longer segments if that seems desirable, but this makes no difference to the issues of computational complexity.
Background: The 3SAT Problem
Before we go into the mechanics of the complexity proof, a few pieces of specialized logical notation and terminology are necessary.
Sima'an shows that if one can decide whether a given sentence has a high probability under a DOP model, then one can also decide whether a logical formula in three variables is satisfiable (the 3SAT problem). I will start by reformulating the 3SAT problem in a form that lends itself more easily to the complexity proof.
Suppose we are asked if the logical formula
(p v q v ~r) ∧ (p v ~q v r) ∧ ... ∧ (~q v s v u)
consisting of m blocks of three literals, mentioning n ≤ 3m different variables. We are asked whether there is an assignment of truth values to the n logical variables that makes this formula true.
A positive answer to this question should come in the form of a truth assignment like
(T:p v T:q v T:~r) ∧ (T:p v F:~q v F:r) ∧ ... ∧ (F:~q v T:s v F:u)
For this assignment to qualify as a correct and positive answer, it must be
- true: assign the value T to at least one of the disjuncts in each group
- consistent: assign the same truth value to any variable x throughout the formula and the opposite to ~x
For example, (F:p) ∧ (F:p) is consistent but false, whereas (T:p) ∧ (T:~p) is a true but inconsistent. It will be convenient below to allow inconsistent truth-value assignments as valid proposals for a solution to the 3SAT problem.
Truth-Value Assignments as Parse Trees
Since truth-value assignments can be represented by formulas like (F:p) ∧ (F:~q v T:r), they can be generated by a context-free grammar.
This grammar can be modified so that eventually outputs the original logical formula with the truth values deleted—in the case of this example, (p) ∧ (~q v r). This can be accomplished by interpreting assignments like T:~q as the name of a non-terminal which always produces the terminal ~q.
This modified grammar will always produce the same surface string, but that string can be the result of different parse trees. Since the parse tree contains non-terminal nodes with Ts and Fs in them, it can be interpreted as a (possibly inconsistent) truth-value assignments. Sima'an's proof uses a representation os this type to reduce the 3SAT to the problem of finding high-probability parse trees.
Most-Likely-DOP-Parse-Tree is NP-Complete
Sima'an shows that the 3SAT problem is not easier than the following problem:
Given a DOP language model P, a sentence s, and a threshold t,
decide whether s has a probability larger than t under the model P.
Suppose we know how to solve this problem, and let a new instance of the 3SAT problem with m conjuncts and n variables be given.
As shown above, candidate solutions to the 3SAT problem can be represented as parses of the logical formula. A parse is a correct solutions when it satisfies both a truth requirement and a consistency requirement. Sima'an's proof constructs a DOP model that generates all the candidates solutions to the given problem, but generates the true and consistent solutions more frequently.
His grammar is constructed such that the formula only has two types of derivations:
- always-true derivations:
- generate a shallow tree leaving the disjunctive groups unspecified
- within each disjunctive group, pick one of the three disjuncts and expand it such that it becomes true
- expand the remaining disjuncts of each group randomly
- probably-consistency derivations:
- pick one of the n variables and generate a deep tree in which that variable is assigned a consistent truth value throughout the parse tree
- expand the remaining disjuncts of each group randomly
Parse trees from the first group always make the formula true and are sometimes consistent. Parse trees from the second group are more likely than chance to be consistent and may occasionally be true. All parse trees result in the same surface string, which is equal to the logical formula.
If a parse tree is both true and consistent, it can derived in at least one way using the first method, and in at least n ways using the second method. The probability of a parse tree that solves the 3SAT problem is therefore higher than the probability of a parse tree that does not. By choosing the probabilities of the various subtrees well, it is possible to pick a threshold such that the existence of a high-probability tree is exactly equivalent to a positive resolution of the 3SAT problem.
The grammar constructed by Sima'an to translate 3SAT to most-likely-parse.
Obviously, I am leaving out some details here, but this is the idea of the proof. I should note that Sima'an's own proof has a what might be considered a minor error, in that he reversed the roles of the truth values and the literals in the parse tree, so that the surface sentence is a string of truth values. This makes no sense because it assumes a solution to the 3SAT problem is given as part of the input, but fortunately, it is an easy error to fix.