CS470/570: Eisner's Algorithm

Eisner's algorithm is a clever adaptation of the CYK/dynamic-programming approach to dependency parsing that avoids the extra dimension created by having to keep track of the head of a phrase covering the interval \(s,t\), a head that might lie anywhere. Eisner's idea was to that dimension by focusing attention on intervals \(s,t\) in which the head is either at \(s\) or \(t\). The former we'll tag with letter L (left-headed), the latter with letter R (right-headed).

Like the other dynamic-programming algorithms, Eisner's keeps track of the best "parse" covering each interval. I put "parse" in quotes because the word usually implies analysis according to some sort of grammar, and all we're assuming is some penalty/reward value \(\lambda_{w,r,w'}\) assessed when \(w'\) occurs as a dependent of \(w\) labeled with relationship \(r\). Well, that's not quite all; we're also assuming that the only analyses we need to consider are projective. So by a parse at \(h\) over \(s,t\) we mean a list of positions in the interval \(s,t\) that constitute the children of the word at \(h\), together with the parses of each child. The best parse is the one whose \(\lambda\)s add up to the highest score.^Highest.

Having made all that clear, we'll do the usual maneuver and focus on the scores, as if they're the primary goal, not a means to an end. So the algorithm is calculating the values in a hash table \(E\) whose keys are pairs \((s,t)\) representing intervals. The values can't just be numbers, but have to be records that keep track of which kind of interval we're considering, an L or an R. And, of course, that turns out to be too simple. There are four cases we need to consider:

It is a consequence of these definitions that the rightmost child of an LR has no dependents to right, and the leftmost child of an RL has no dependents to left. The astute reader will have noticed that there is overlap among these four categories. An LL parse in which the rightmost child of \(s\) has no dependents to right^{To right} is also (by definition) an LR parse.^{LR terminology} However, we think of building every LL in two stages: First adjoin a new rightmost child by putting an LL and an RR together to make an LR (and add the \(\lambda_{w_s,w_t}\) "reward" at this point).^{Arc label} Then take the right dependents into account by putting the new LR together with an old LL. (The word "then" is misleading, because that operation will take place when the dynamic-programming protocol says it's time; not necessarily right after the construction of the LR.)

The parse object must have four fields:

.LL,
    .LR, .RR, .RL

. Kübler et al. use a numerical scheme instead of thinking of table entries as records. And, of course, what I am calling a "hash table" they think of as an array with weird subscripts. Their data structure is then a four-dimensional array whose last two subscripts are 0 or 1, translated into my notation thus:

The first subscript may be decoded by letting 1=L and 0=R; but the second subscript is 0 if the two letters are the same, 1 if they are different. (Kübler et al. use triangles to diagram the case where that second subscript is 0, trapezoids for the case where it's 1.)

The initialization loop sets table((s,s)) (for s in the range 0 to \(n-1\), where \(n\) is the number of words) to a record with LL = LR = RR = RL = 0.0. The main loop then iterates like the CYK algorithm, taking length from 2 to \(n\), then taking s from 0 to \(n\)-length and letting t = s + length - 1. In the inner loop E is updated as described by Kübler et al., but in the record notation the code looks like this:

In lines 7 and 8 the "trapezoids" are constructed; in lines 9 and 10, the "triangles." The only difference between 7 and 8 is which end is the head: in line 7 word \(t\) must be the head, and the score is augmented by \(\lambda_{w_t, w_s}\); in line 8 word \(s\) gets the honor and the \(\lambda\) for the opposite arc is awarded. Of course, the sums must be completely recomputed as a result, and a different \(q\) may come out ahead, changing the analysis completely. (The two \(\lambda\)s probably have nothing to do with each other.)

Note that in lines 9 and 10, word q is shared between the RL and RR (or LR and LL) being tied together, reflecting the fact that one of the pair brings the right dependents of word q, the other the left dependents. That's why the variable \(q\) ranges up to \(t\) in lines 9 and 10: an RR (or LL) may have a left (or right) boundary that coincides with that of the RL (or LR) it was built out of, if its leftmost (or rightmost) child has no dependents to the left (or right).

Notes

Note Highest
Although if you think of \(\lambda_{w,r,w'}\) as a negative log of a probability, and I tend to think that way, then we're looking for the lowest total score. [Back]

Note LR terminology
My use of LR and RL in the notes for this lecture has nothing to do with LR and RL parsing in the usual sense. [Back]

Note To right
Remember that "no dependents to right" means "no such dependents in this interval." The algorithm might find some more to add when it considers wider intervals. [Back]

Note Arc label
What happened to the arc label? As explained by Kübler et al., because we're finding the maximum or minimum projective tree, all we need to keep track of is the labeled arc between two words with the maximum or minimum value. Not surprisingly, this is called the unlabeled arc. [Back]

2014-11-21 Lec 35 Eisner's Algorithm

Notes