CS470/570: CYK Algorithm

The name "CYK" stands for "Cocke, Younger, Kasami." You might picture the team of Cocke, Younger, and Kasami working together in feverish excitement as they perfected their eponymous algorithm. In fact, Cocke's 1970 paper about the algorithm was predated by Younger's 1967 paper, which was predated by Kasami's 1965 paper.^credit

The CYK algorithm is a simple application of dynamic programming to parsing. Dynamic programming^terminology may be thought of as static memoization. A memoized function looks up its argument in a table, and adds the result to the table if the argument isn't found. Rather than fill the table in this demand-driven way, you can do a preanalysis to figure out which entries are going to be needed in what order, and build the table in that order.

a preanalysis would tell you that you're going to need all the values from 0 to n-1, which recovers the iterative algorithm from the recursive one.

In the case of parsing, you can guess that to find all the ways of parsing a string of \(n\) words you might need all the ways of parsing each substring. Of course, these might turn out to be a superset of what you actually need, which is the sort of collateral damage that is routine in dynamic-programming applications.

We'll represent the table of partial parses using a hashtable; nowadays known as a hashmap or a hash, depending on what programming language you're using. We'll call it a HashMap in what follows. We're given a string of n words, numbered starting at 0 (which might be an artificial Root "word"). We'll assume our grammar is in Chomsky Normal Form, which means that all rules are either of the form \(N \rightarrow N1 N2\) or \(N \rightarrow Word\). (Relaxing this assumption would not be difficult.)

Using a sort of functional style, with symbols' types declared by puttting a colon between the symbol and its type, and with type parameters in brackets, here's the algorithm:

The table can be thought of as an upper-triangular matrix, filled by starting with the main diagonal and moving northeast. When cell (0,n) is filled, you are done. If it contains the symbol S, the string can be generated starting from the S symbol, and so is a sentence.

Simple example: We'll take a somewhat simple-minded grammar,^change so some short sentences will belong to the language —

The matrix is 4x4, but only the northeast half is used. (Hence the use of a Hashmap to keep track of it rather than an array, for which we'd have to play distracting games with subscripts.)

Speaking of distracting games with subscripts, doing 1-based indexing is just painful, so I will assume words are numbered 0 to n-1. Hence our hash table has keys from (0,0) to (3,3). In general a key is of the form (s,t), which denotes the string from s to t inclusive (so t ≥ s).

The initialization loop populates cell (s,s) with nonterminals N such that the word W at position s is the RHS of a rule N --> W.

Now we enter the nested loop that is the main part of the algorithm. The outer loop fills the entries in the table for longer and longer substrings. The inner loop looks at all the decompositions of the string (s, s + length - 1). On each outer iteration there is one less length to worry about, but each substring of that length can be decomposed in more ways.

On the first iteration we find phrases covering strings of length 2, filling cells (0,1), (1,2), (2,3), and (3,4). Each of these keys is of the form (s, s+1), and can be decomposed in just one way, into (s,s) and (s+1, s+1). For instance, cell (0,1) (corresponding to "tag men") can be filled with X if there is a rule X --> NP NP or X --> V NP. The latter pattern indeed exists: X is S. Cell (1,2) (the phrase "men with") requires a rule X --> NP Prep, but there isn't any so it stays empty. Continuing in this way, we fill the diagonal thus

On the next outer-loop iteration, we fill cells (0,2) ("tag men with") and (1,3) ("men with telescopes"). There are two ways to decompose each of these. For instance, (1,3) is (1,1)+(2,3) and (1,2)+(3,3). The latter is useless, but the former yields NP. (0,2) cannot be analyzed.

On the last outer iteration, we fill cell (0,3), the entire string. There are three different decompositions, (0,0)+(1,3), (0,1)+(2,3), and (0,2)+(3,3). The first and second both yield S, in two different ways. So we end up with this table:

If you want to reconstruct the parse trees validating an S or other phrase found by the algorithm (a possibly large set; see below), then you've got to save a bit more information. We introduce a class

If table((s,t)) contains a Parse object with Rule = S -> N1 N2 and endFirst = q, then you can reconstruct the parse tree by looking for a rule (N1 -> ...) in table((s,q)) and a rule (N2 -> ...) in table((q+1, t)), and so on recursively until you reach a Parse object covering a span of length 1.

But, like underspecified representations, it is far from obvious what to do with the set of parse trees implied by the table. There are exponentially many of them. For example, Manning and Sh\"utze 2000, Exercise 12.2, report that the sentence

with a plausible little grammar, has 83 different parses. Most of them don't seem to make any sense, but for each construct in the grammar there are perfectly good sentences whose grammaticality depends on that construct being there.

In addition to the many different Ss at the top level, there can be many phrases that aren't part of any S. For example, with a comprehensive grammar of English, the CYK algorithm applied to this sentence:

References: Christopher Manning and Hinrich Sch\"utze 2000 Foundations of Statistical Natural Language Processing. MIT Press

Note credit
A similar example is the Hindley-Milner algorithm. Milner published a paper about this type-inference algorithm in 1980 in a forum known to the programming-language community, who later became aware of a 1970 paper by the logician Hindley that devised essentially the same algorithm. These instances illustrate a variation on Stigler's Law, which states that a scientific result is never named after its first discoverer. In the cases of CYK and Hindley-Milner, an algorithm is named after its first, second, and possibly third discoverer. [Back]

Note terminology
"Programming" is here used in the same sense as in "linear programming"; it's operations-research jargon for "planning." The word "dynamic" means "changing over time," and refers to the fact that we're planning for a staged process: unknowns come in groups corresponding to stages, and the unknowns at stage t are a function of unknowns at stage t-1. [Back]

Note change
I've replaced the category VP used in class with S for greater consistency. [Back]

Note idiom
for non-native English speakers: "doing someone in" means to kill them, usually used semi-humorously when referring to fiction. [Back]

2014-11-12 Lec 31 CYK Algorithm