The name "CYK" stands for "Cocke, Younger, Kasami." You might picture the team of Cocke, Younger, and Kasami working together in feverish excitement as they perfected their eponymous algorithm. In fact, Cocke's 1970 paper about the algorithm was predated by Younger's 1967 paper, which was predated by Kasami's 1965 paper.credit
The CYK algorithm is a simple application of dynamic programming to parsing. Dynamic programmingterminology may be thought of as static memoization. A memoized function looks up its argument in a table, and adds the result to the table if the argument isn't found. Rather than fill the table in this demand-driven way, you can do a preanalysis to figure out which entries are going to be needed in what order, and build the table in that order.
For example, if you define the Fibonacci function recursively, as
fib(n) = if n≤1 then 1 else fib(n-1) + fib(n-2)
a preanalysis would tell you that you're going to need all the
values
from 0 to n
-1, which recovers the
iterative algorithm from the recursive one.
In the case of parsing, you can guess that to find all the ways of parsing a string of \(n\) words you might need all the ways of parsing each substring. Of course, these might turn out to be a superset of what you actually need, which is the sort of collateral damage that is routine in dynamic-programming applications.
We'll represent the table of partial parses using a hashtable;
nowadays known as a hashmap or a hash, depending on what programming
language you're using. We'll call it
a HashMap
in what follows.
We're
given a string of n
words, numbered starting at 0
(which might be an artificial Root
"word"). We'll assume our grammar is in Chomsky Normal Form, which
means that all rules are either of the form \(N \rightarrow N1 N2\)
or \(N \rightarrow Word\). (Relaxing this assumption would not be
difficult.)
Using a sort of functional style, with symbols' types declared by puttting a colon between the symbol and its type, and with type parameters in brackets, here's the algorithm:
def CYK(buffer: Vector[Word]): HashMap[(Int, Int), Set[Nonterminal]] = { // n is the number of words val n = buffer.length; // table((s,t)) = All Nonterminals for which words s, s+1, ..., t // are an example of that nonterminal val table = new mutable.HashMap[(Int, Int), // key type Set[Nonterminal] // value type ]; // Initialize table: for (s <- 0 to n-1) { val word = buffer(s); table((s,s)) = list of all nonterminals N such that N --> word } // -- We've now filled the "main diagonal" of the "matrix" // stored in the HashMap table. // Main nested loops of algorithm: Outer loop goes through // diagonals in decreasing order of length. The last is the // the single entry for (0,n). -- for (length <- 2 to n) for (s <- 0 to (n - length)) { val t = s + length - 1; // We're going to fill table((s,t)) with all the nonterminals N // such that the substring from s to t is an N table((s,t)) = empty set; for (q <- s to (t - 1); n1 <- table((s,q)); n2 <- table((q+1, t)); r <- rules in the grammar of the form r --> n1 n2) { table((s,t)) = table((s,t)) + r.lhs } } return table }
The table
can be thought of as an
upper-triangular matrix,
filled by starting with the main diagonal and moving northeast. When
cell (0,n)
is filled, you are done.
If it contains the symbol S
, the string
can be generated starting from the S
symbol, and so is a sentence.
Simple example: We'll take a somewhat simple-minded grammar,change so some short sentences will belong to the language —
S --> V NP V --> "hit" Prep --> "with" S --> S PP V --> "tag" Prep --> "on" NP --> NP PP NP --> "tag" NP --> "telescopes" PP --> Prep NP NP --> "men"
Here's an example of the algorithm in action:
Sentence: "Tag men with telescopes"
The matrix is 4x4, but only the northeast half is used. (Hence the use of a Hashmap to keep track of it rather than an array, for which we'd have to play distracting games with subscripts.)
Speaking of distracting games with subscripts, doing 1-based indexing is just painful, so I will assume words are numbered 0 to n-1. Hence our hash table has keys from (0,0) to (3,3). In general a key is of the form (s,t), which denotes the string from s to t inclusive (so t ≥ s).
The initialization loop populates cell (s,s) with nonterminals N such that the word W at position s is the RHS of a rule N --> W.
After initialization, the matrix looks like this:
0 1 2 3 0 NP,V 1 NP 2 Prep 3 NP
(If a word corresponds to no such rule, an exception must be thrown.)
Now we enter the nested loop that is the main part of the algorithm. The outer loop fills the entries in the table for longer and longer substrings. The inner loop looks at all the decompositions of the string (s, s + length - 1). On each outer iteration there is one less length to worry about, but each substring of that length can be decomposed in more ways.
On the first iteration we find phrases covering strings of length 2, filling cells (0,1), (1,2), (2,3), and (3,4). Each of these keys is of the form (s, s+1), and can be decomposed in just one way, into (s,s) and (s+1, s+1). For instance, cell (0,1) (corresponding to "tag men") can be filled with X if there is a rule X --> NP NP or X --> V NP. The latter pattern indeed exists: X is S. Cell (1,2) (the phrase "men with") requires a rule X --> NP Prep, but there isn't any so it stays empty. Continuing in this way, we fill the diagonal thus
0 1 2 3 0 NP,V S 1 NP - 2 Prep PP 3 NP
On the next outer-loop iteration, we fill cells (0,2) ("tag men with") and (1,3) ("men with telescopes"). There are two ways to decompose each of these. For instance, (1,3) is (1,1)+(2,3) and (1,2)+(3,3). The latter is useless, but the former yields NP. (0,2) cannot be analyzed.
0 1 2 3 0 NP,V S - 1 NP - NP 2 Prep PP 3 NP
On the last outer iteration, we fill cell (0,3), the entire string. There are three different decompositions, (0,0)+(1,3), (0,1)+(2,3), and (0,2)+(3,3). The first and second both yield S, in two different ways. So we end up with this table:
0 1 2 3 0 NP,V S - S 1 NP - NP 2 Prep PP 3 NP
If you want to reconstruct the parse trees validating an S or other phrase found by the algorithm (a possibly large set; see below), then you've got to save a bit more information. We introduce a class
class Parse(val rule: Rule, val endFirst: Int) {}An object of this class records a parse based on the given rule, whose first constituent ends at endFirst.
Change the declaration of table to
val table = new mutable.HashMap[(Nonterminal, Nonterminal), // key type Set[Parse] // value type ];
Change the initialization to
for (s <- 0 to n-1) { val word = buffer(s); table((s,s)) = for (r <- rules of form(N --> word)) yield new Parse(r, s) }
And change the line table((s,t)) += N
to
table((s,t)) += new Parse(N --> N1 N2, q)
If table((s,t)) contains a Parse object with Rule = S -> N1 N2 and endFirst = q, then you can reconstruct the parse tree by looking for a rule (N1 -> ...) in table((s,q)) and a rule (N2 -> ...) in table((q+1, t)), and so on recursively until you reach a Parse object covering a span of length 1.
But, like underspecified representations, it is far from obvious what to do with the set of parse trees implied by the table. There are exponentially many of them. For example, Manning and Sh\"utze 2000, Exercise 12.2, report that the sentence
"The agency sees widespread use of the codes as a way of handling the rapidly growing mail volume and controlling labor costs."
with a plausible little grammar, has 83 different parses. Most of them don't seem to make any sense, but for each construct in the grammar there are perfectly good sentences whose grammaticality depends on that construct being there.
In addition to the many different Ss at the top level, there can be many phrases that aren't part of any S. For example, with a comprehensive grammar of English, the CYK algorithm applied to this sentence:
"Desperate to do their evil overlords in oppressed villages will often rebel."idiom
finds "phrases" such as
which are not part of any parse of a grammatical S (that I can think of).
References: Christopher Manning and Hinrich Sch\"utze 2000 Foundations of Statistical Natural Language Processing. MIT Press
Note ©Drew McDermott 2014
[Back]
Note credit
A similar example is the Hindley-Milner algorithm. Milner
published a paper about this type-inference algorithm in 1980 in a
forum known to the programming-language community, who later became
aware of a 1970 paper by the logician Hindley that devised essentially
the same algorithm. These instances illustrate a variation on Stigler's Law, which states
that a scientific result is never named after its first discoverer.
In the cases of CYK and Hindley-Milner, an algorithm is named after
its first, second, and possibly third discoverer.
[Back]
Note terminology
"Programming" is here used in the same sense as in "linear
programming"; it's operations-research jargon for "planning." The
word "dynamic" means "changing over time," and refers to the fact that
we're planning for a staged process: unknowns come in groups
corresponding to stages, and the unknowns at stage t are a
function of unknowns at stage t-1.
[Back]
Note change
I've replaced the category VP used in class
with S for greater consistency.
[Back]
Note idiom
for non-native English speakers: "doing someone in" means to
kill them, usually used semi-humorously when referring to fiction.
[Back]