CS470/570 Problem Set 4 (Take-Home Final):
Earley Parser and Statistical Parsing

Revision History
Dec 18Added SyntaxTreeUtil.scala to uploaded-file list.
Dec 27Made a major improvement in the definition of BackPointerTree.

Due: December 21, 2016


All the files mentioned below may be obtained from Files > Problem-Set Sofware on Canvas. (Even if you have downloaded them before, do so again, because they have changed substantially. To be on the safe side, delete or rename the bin subdirectory you've been using,bin and create a new one. Before using any function from one of these files, check the package declaration at the beginning. All of them now live in one package or another. (Once you start using packages at all, the anonymous package becomes inaccessible.) All packages have names of the form natlang.P where P is util, earley, and the like. Note: To invoke an App named A in package natlang.P from the command line, you must type

sca natlang.P.A

or, if you prefer,

scala -cp bin natlang.P.A

Three packages in particular are relevant here:

Package Description Files
util Utilities Utilities.scala, LazyArray.scala
cfg Context-free grammar CFG.scala, SyntaxTreeUtil.scala, RecursiveDescentGeneral.scala
earley Earley Parser Earley.scala, Your answer to Q1

The current version of the parser in Earley.scala returns a value of type

    (List[(LeafOrRule, Set[BackPointer])], 
     StateSetArray)

(where LeafOrRule is an abbreviation for Either(CFGtreeLeaf, CFGrule)) and StateSetArray is an abbreviation for LazyArray1[StateSet]. The second element of the pair returned by the parser is the array of all state sets used by the dynamic-programming algorithm. The first element is a list of pairs (r, bps), where r is (almost certainly) a CFGrule whose lhs is S, and bps is the set of backpointers constituting a "proof" that the words from 0 to L are an S according to this rule; where L is the length of the sentence (and the size of the StateSetArray – 1).not quite certain

A backpointer for a rule \(A \rightarrow B_1 B_2 \ldots B_K\) and the interval \(s,e\) is a list \(\langle m_1, \ldots, m_{K-1}\rangle\), where \(s < m_1 < \ldots < m_{K-1} < e\), records the parser's discovery that there is a \(B_1\) over the interval \([s,m_1]\) (maybe more than one), at least one \(B_2\) over \([m_1,m_2]\), \(\ldots\), and at least one \(B_K\) over \([m_{K-1}, e]\). Therefore the interval \([s,e]\) contains an \(A\).Intervals

Implement backPtrSetToSyntaxTrees, which transforms a set of backpointers into a stream of syntax trees. Its signature is

  def backPtrSetToSyntaxTrees(bptrSet: Set[BackPointer],
                              start: Int, end: Int, 
                              rule: LeafOrRule,
                              states: StateSetArray):
      Stream[CFGsyntaxTree] 

You might find it easier to transform the backpointer sets into backpointer trees first, but it's not necessary. Don't overlook the following subroutines and utilities:

Useful Utility Functions: In each case the function lives in the object with the same name as the file.
Name File Description
streamCartProd Utilities.scala A function that transforms a List[Stream[A]] into a Stream[List[A]] by finding all combinations of elements from each stream in the list.
streamsInterleave Utilities.scala A function that takes a stream of streams Sand produces a stream that includes all the elements of each stream in S (the union, in some sense, of all the streams in \(S\)).
BackPointerTree Earley.scala An abstract class representing all the syntax trees that can be generated for a given terminal or nonterminal symbol \(S\)over a given interval. It has four case-class subclasses:
  1. BackPointerCrossProd: For the usual case of a CFGrule covering \(n\) subintervals, \(v_0, \ldots, v_{n-1}\). Every way of producing each \(v_i\) must be combined with every way of producing each \(v_j, j \neq i\) because the grammar is context-free.
  2. BackPointerNonterm: When the same rule can be implied to the overall interval by dividing it into different subintervals, we can just take the union of all the trees derived from these different divisions.
  3. BackPointerTreeSet: To capture an occasion when there are multiple trees covering the same interval, all sharing the same topmost symbol. For an example, there might be 3 <.VP>s over the interval [3,11]. This corresponds to a BackPointerTreeSet(3, 11, ntVP, subs), where ntVP has as value the nonterminal symbol <.VP>; and subs is a list of 3 subtrees, all having in common only that they describe some sort of verb phrase.
  4. BackPointerLeaf: The base case, where a terminal symbol covers a single word of the input.
backPtrSetToBackPtrTree Earley.scala Given a set of backpointers, the interval they belong to, the LeafOrRule that covers that interval, and the StateSetArray produced by the dynamic-programming algorithm, produces a tree-structured representation of the set of syntax trees (type BackPointerTree) (The arguments to this function are the same as for the function you must write, backPtrSetToSyntaxTrees, so it's worth studying for that reason alone.)
ParserState Earley.scala The class representing a dotted rule. The last argument of backPtrSetToSyntaxTrees is a StateSetArray, a LazyArray1 of sets of ParserState.
statesCovering Earley.scala Finds all ParserStates in a StateSetArray covering a given interval for given terminal or nonterminal symbol.
syntaxTreeToHierString SyntaxTreeUtil.scala Produces a string that is a "pretty-printed" representation of a a CFGsyntaxTree, with one tree node per line, indented to show the tree structure
grammarFromFile CFGlexicon.scala Reads a grammar and lexicon in an S-expression format:
((terminals -names-)
 (nonterminals -names-)
 (startsym symname)
 (lexicon -entries-)
 (rules -sexps-))
        
See the comment before the definition for details.

The Earley algorithm as documented makes the simplifying assumption that there are no rules of the form N → ε. which indicates that nonterminal N can be rewritten as the empty string. How would the algorithm have to change if they were allowed? Mention at least two places and be specific about the changes required.

In the HMM alignment model discussed by Jurafsky and Martin in ch. 25, sec. 25.5.2, pp. 883ff., several details are left rather vague. This question asks you to fill them some of them in.

3a In the overall translation task based on HMM alignments (Jurafsky & Martin, pp. 883–885), what are the observations? What are the hidden states?

3b Jurafsky and Martin appeal to standard practice in HMM-based alignment research, which is to assume that the \(a_j|a_{j-1}\) distribution depends only on the difference between \(a_j\) and \(a_{j-1}\). That is, there is a function \(c: [-I,I] \rightarrow [0,1]\) from the integers between \(-I\) and \(I\), inclusive such that \(c(\Delta)\)is proportional to the probabality that a word gets shifted by \(\Delta\):

$$ P(a_j = a_{j-1} + 1 + \Delta) = \frac{c(\Delta)}{\sum_{d=1-I}^{I-1} c(d)} \mbox{(Eq. 25.27)}$$

The \(c\) function is not given; it must be learned, and your problem is to learn it. The "1" on the left side of the equation is there because word \(j\) in the \(F\) sentence is 1 to the right of word \(j-1\), so we expect the alignment to send it to \(a_{j-1} + 1\) on the \(E\) side. And, in the nature of things, we expect the \(c\) function to be roughly symmetric around, and peak at, the origin. (Words are most likely not to be shifted by much; and for every word shifted left another must be shifted right.Symmetry) The \(c\) function is a sort of histogram; what's more important than the absolute values are the heights of each column as a fraction of the sum of the heights of all the columns. There are some border effects that must be dealt with. Equation 25.27 as written assumes that the minimum shift is \(1-I\) to the left, the maximum shift is \(I-1\) to the right, and everything in between are all legal. (\(I\) is the sentence length in language \(E\), i.e., 20 in the data below.) But at the beginning the first legal transition is \(0\rightarrow 0\) and at the end the last legal transition is \(23\rightarrow 19\). A jump of \(I-1 = 19\) is the largest possible, but can happen only at the beginning; the jump of \(1-I = -19\), only at the end; neither is very likely.

The usual resolution of this problem is to zero out the probabilities of illegal transitions; the remaining ones are renormalized to make sure that the they add up to one.

Fortunately, we are in possession of some high-quality annotated data: 10 runs agreed upon by three certified translators from a language \(F\) that shall go unidentified into English (language \(E\)). Each run concerns a 24-word \(F\) sentence, and they all come out to be 20 words long in English. The following shows \(e_j\) for \(j\) from 0 to 23. For example, in run 1, \(F\) word 0 winds up in position 1. (The data use zero-based indexing for the word positions.) The numbers before the slashes are run numbers (starting at 1); they are not part of the data.Certified

 1/ 1  3  4  5  5  9 12 12 16 15 13 14 12 11 14 16 15 18 17 19 19 17 16 17 

 2/ 3  2  3  4  0  1  2  2  2  1  3  3  3  3  4  2  3  5  9  8 12 17 19 18 

 3/ 2  4  6  7 10 12 14 15 15 13 15 17 19 19 16 16 18 17 18 18 19 18 16 19 

 4/ 0  1  4  5  8  8 10 10 11 13 19 19 19 18 18 16 17 19 18 19 18 17 17 16 

 5/ 0  1  1  3  5  8 10 10  7  8 10 14 13 13 13 11 12 11 12 11 12 14 14 14 

 6/ 3  2  1  1  0  1  1  0  1  3  2  6 10 11 10 10  7  6  6  3  4  1  0  3 

 7/ 1  2  4  7  8  8  9 13 17 18 19 19 19 18 17 16 15 17 17 18 17 17 19 18 

 8/ 1  3  3  5  5  5  8 10  9 10 13 16 15 16 17 15 14 18 19 18 18 16 19 18 

 9/ 2  4  6  5  7  9 14 15 16 17 18 18 19 19 19 18 15 18 16 18 19 18 17 17 

10/ 1  4  7  6  6  6  8 10  9  9 12 15 15 19 19 19 17 17 16 17 19 16 18 17 

Because we know the values for \(e_j\), we can use an equation like equation (6.26) on page 188 of Jurafsky & Martin. We don't have to mess with the Baum-Welch algorithm or expectation maximization. (Hint: Equation 6.26 involves counting state transitions; what would we count in the data above to fill in the slot of the \(c\) array holding the probability of jumping by \(\Delta\)?)

Put your estimate of \(c\) in a file PS4Q3.data. If you used any programs to help perform the relevant counts, be sure to attach them. (You can use programming languages other than Scala, and I'll even count Python as a programming language, though it goes against my principles.)

Endnotes

Note bin
If you've tried to get along using your main code directory as your class-file repository, then delete all the class files and the natlang subdirectory if there is one. Then start using a bin subdirectory to avoid all this nonsense. Using the scc and sca scripts will make using the bin directory effortless, but if you're using a programming environment, set the CLASSPATH to include bin. [Back]

Note not quite certain
It is possible that the first element is a CFGtreeLeaf, meaning that L=1 and in this grammar the startsym (perhaps not actually S) is a terminal symbol. The parser should handle this case. By the way, to be rigorous I should have said that the first element of the pair is either Left(l), where l is a leaf, or or Right(r), where r is a rule. The Either type is not the union of two types. It's Right(r) that's almost certain, at the top level. [Back]

Note Intervals
The notation \([i,j]\) means the words between positions \(i\) and \(j\), where a position is given as the number of words to its left. I.e., the interval covers words \(i+1\) through \(j\), assuming words are numbered starting with 1. (If they are numbered starting with 0, as is more common when writing programs where the numbers are in an indexed structure such as an array, then \([i,j]\) covers words \(i\) through \(j-1\).) [Back]

Note Symmetry
This formulation ignores length effects. If the expected ratio of lengths of a sentence in language \(F\) to the length of its translation into \(E\) \(\neq 1.0\), then there is a tug one direction or the other. But we'll neglect this fact. [Back]

Note Certified
The data look somewhat weird; just how good are our "certified translators"? Not good at all; the numbers were actually produced by running a Markov model. It preserves locality pretty well — perhaps too well. There are a lot of runs of the same number, and too many \(E\) words are never accounted for. In practice, some of these problems are overcome by the fact that alternative hypotheses are competing, and other factors can choose among alternatives that the HMM finds about equally likely. [Back]