Revision History | |
---|---|
Dec 18 | Added SyntaxTreeUtil.scala to uploaded-file list. |
Dec 27 | Made a major improvement in the definition
of BackPointerTree .
|
Due: December 21, 2016
All the files mentioned below may be obtained from Files > Problem-Set Sofware on Canvas. (Even if you have downloaded them before, do so again, because they have changed substantially. To be on the safe side, delete or rename the bin subdirectory you've been using,bin and create a new one. Before using any function from one of these files, check the package declaration at the beginning. All of them now live in one package or another. (Once you start using packages at all, the anonymous package becomes inaccessible.) All packages have names of the form natlang.P where P is util, earley, and the like. Note: To invoke an App named A in package natlang.P from the command line, you must type
sca natlang.P.A
or, if you prefer,
scala -cp bin natlang.P.A
Three packages in particular are relevant here:
Package | Description | Files |
---|---|---|
util | Utilities | Utilities.scala, LazyArray.scala |
cfg | Context-free grammar | CFG.scala, SyntaxTreeUtil.scala, RecursiveDescentGeneral.scala |
earley | Earley Parser | Earley.scala, Your answer to Q1 |
The current version of the parser in Earley.scala returns a value of type
(List[(LeafOrRule, Set[BackPointer])], StateSetArray)
(where LeafOrRule is an abbreviation for Either(CFGtreeLeaf, CFGrule)) and StateSetArray is an abbreviation for LazyArray1[StateSet]. The second element of the pair returned by the parser is the array of all state sets used by the dynamic-programming algorithm. The first element is a list of pairs (r, bps), where r is (almost certainly) a CFGrule whose lhs is S, and bps is the set of backpointers constituting a "proof" that the words from 0 to L are an S according to this rule; where L is the length of the sentence (and the size of the StateSetArray – 1).not quite certain
A backpointer for a rule \(A \rightarrow B_1 B_2 \ldots B_K\) and the interval \(s,e\) is a list \(\langle m_1, \ldots, m_{K-1}\rangle\), where \(s < m_1 < \ldots < m_{K-1} < e\), records the parser's discovery that there is a \(B_1\) over the interval \([s,m_1]\) (maybe more than one), at least one \(B_2\) over \([m_1,m_2]\), \(\ldots\), and at least one \(B_K\) over \([m_{K-1}, e]\). Therefore the interval \([s,e]\) contains an \(A\).Intervals
Implement backPtrSetToSyntaxTrees, which transforms a set of backpointers into a stream of syntax trees. Its signature is
def backPtrSetToSyntaxTrees(bptrSet: Set[BackPointer], start: Int, end: Int, rule: LeafOrRule, states: StateSetArray): Stream[CFGsyntaxTree]
You might find it easier to transform the backpointer sets into backpointer trees first, but it's not necessary. Don't overlook the following subroutines and utilities:
Name | File | Description |
---|---|---|
streamCartProd |
Utilities.scala |
A function that transforms a List[Stream[A]]
into a Stream[List[A]] by finding all
combinations of elements from each stream in the list.
|
streamsInterleave |
Utilities.scala |
A function that takes a stream of streams Sand produces a stream that includes all the elements of each stream in S (the union, in some sense, of all the streams in \(S\)). |
BackPointerTree |
Earley.scala |
An abstract class representing all the syntax trees that can
be generated for a given terminal or nonterminal symbol \(S\)over a
given interval. It has four case-class subclasses:
|
backPtrSetToBackPtrTree |
Earley.scala |
Given a set of backpointers, the interval they belong to, the
LeafOrRule that covers that interval, and the
StateSetArray produced by the dynamic-programming
algorithm, produces a tree-structured representation of the
set of syntax trees (type BackPointerTree )
(The arguments to this function are the same as for the
function you must write, backPtrSetToSyntaxTrees ,
so it's worth studying for that reason alone.)
|
ParserState |
Earley.scala |
The class representing a dotted rule. The last argument of
backPtrSetToSyntaxTrees is
a StateSetArray , a LazyArray1 of
sets of ParserState .
|
statesCovering |
Earley.scala |
Finds all ParserState s in
a StateSetArray covering a given interval for given
terminal or nonterminal symbol.
|
syntaxTreeToHierString |
SyntaxTreeUtil.scala |
Produces a string that is a "pretty-printed" representation of a a CFGsyntaxTree, with one tree node per line, indented to show the tree structure |
grammarFromFile |
CFGlexicon.scala |
Reads a grammar and lexicon in an S-expression format:
((terminals -names-) (nonterminals -names-) (startsym symname) (lexicon -entries-) (rules -sexps-))See the comment before the definition for details. |
The Earley algorithm as documented makes the simplifying assumption that there are no rules of the form N → ε. which indicates that nonterminal N can be rewritten as the empty string. How would the algorithm have to change if they were allowed? Mention at least two places and be specific about the changes required.
In the HMM alignment model discussed by Jurafsky and Martin in ch. 25, sec. 25.5.2, pp. 883ff., several details are left rather vague. This question asks you to fill them some of them in.
3a In the overall translation task based on HMM alignments (Jurafsky & Martin, pp. 883–885), what are the observations? What are the hidden states?
3b Jurafsky and Martin appeal to standard
practice in HMM-based alignment research, which is to assume that the
\(a_j|a_{j-1}\) distribution depends only on the
The \(c\) function is not given; it must be learned, and your problem is to learn it. The "1" on the left side of the equation is there because word \(j\) in the \(F\) sentence is 1 to the right of word \(j-1\), so we expect the alignment to send it to \(a_{j-1} + 1\) on the \(E\) side. And, in the nature of things, we expect the \(c\) function to be roughly symmetric around, and peak at, the origin. (Words are most likely not to be shifted by much; and for every word shifted left another must be shifted right.Symmetry) The \(c\) function is a sort of histogram; what's more important than the absolute values are the heights of each column as a fraction of the sum of the heights of all the columns. There are some border effects that must be dealt with. Equation 25.27 as written assumes that the minimum shift is \(1-I\) to the left, the maximum shift is \(I-1\) to the right, and everything in between are all legal. (\(I\) is the sentence length in language \(E\), i.e., 20 in the data below.) But at the beginning the first legal transition is \(0\rightarrow 0\) and at the end the last legal transition is \(23\rightarrow 19\). A jump of \(I-1 = 19\) is the largest possible, but can happen only at the beginning; the jump of \(1-I = -19\), only at the end; neither is very likely.
The usual resolution of this problem is to zero out the probabilities of illegal transitions; the remaining ones are renormalized to make sure that the they add up to one.
Fortunately, we are in possession of some high-quality annotated data: 10 runs agreed upon by three certified translators from a language \(F\) that shall go unidentified into English (language \(E\)). Each run concerns a 24-word \(F\) sentence, and they all come out to be 20 words long in English. The following shows \(e_j\) for \(j\) from 0 to 23. For example, in run 1, \(F\) word 0 winds up in position 1. (The data use zero-based indexing for the word positions.) The numbers before the slashes are run numbers (starting at 1); they are not part of the data.Certified
1/ 1 3 4 5 5 9 12 12 16 15 13 14 12 11 14 16 15 18 17 19 19 17 16 17 2/ 3 2 3 4 0 1 2 2 2 1 3 3 3 3 4 2 3 5 9 8 12 17 19 18 3/ 2 4 6 7 10 12 14 15 15 13 15 17 19 19 16 16 18 17 18 18 19 18 16 19 4/ 0 1 4 5 8 8 10 10 11 13 19 19 19 18 18 16 17 19 18 19 18 17 17 16 5/ 0 1 1 3 5 8 10 10 7 8 10 14 13 13 13 11 12 11 12 11 12 14 14 14 6/ 3 2 1 1 0 1 1 0 1 3 2 6 10 11 10 10 7 6 6 3 4 1 0 3 7/ 1 2 4 7 8 8 9 13 17 18 19 19 19 18 17 16 15 17 17 18 17 17 19 18 8/ 1 3 3 5 5 5 8 10 9 10 13 16 15 16 17 15 14 18 19 18 18 16 19 18 9/ 2 4 6 5 7 9 14 15 16 17 18 18 19 19 19 18 15 18 16 18 19 18 17 17 10/ 1 4 7 6 6 6 8 10 9 9 12 15 15 19 19 19 17 17 16 17 19 16 18 17
Because we know the values for \(e_j\), we can use an equation like equation (6.26) on page 188 of Jurafsky & Martin. We don't have to mess with the Baum-Welch algorithm or expectation maximization. (Hint: Equation 6.26 involves counting state transitions; what would we count in the data above to fill in the slot of the \(c\) array holding the probability of jumping by \(\Delta\)?)
Put your estimate of \(c\) in a file PS4Q3.data. If you used any programs to help perform the relevant counts, be sure to attach them. (You can use programming languages other than Scala, and I'll even count Python as a programming language, though it goes against my principles.)
Note bin
If you've tried to get along using your main code directory as
your class-file repository, then delete all the class files and the
natlang subdirectory if there is one. Then start using
a bin subdirectory to avoid all this nonsense. Using
the scc
and sca
scripts will make using
the bin directory effortless, but if you're using a
programming environment, set the CLASSPATH to include bin.
[Back]
Note not quite certain
It is possible that the first element is a CFGtreeLeaf,
meaning that L=1 and
in this grammar the startsym (perhaps not
actually S) is a terminal symbol. The parser should handle
this case. By the way, to be rigorous I should have said that the
first element of the pair is either Left(l),
where l is a leaf, or
or Right(r), where r is a rule.
The Either type is not the union of two types. It's
Right(r) that's almost certain, at the top level.
[Back]
Note Intervals
The notation \([i,j]\) means the words between positions \(i\)
and \(j\), where a position is given as the number of words to its
left.
I.e., the interval covers words \(i+1\) through \(j\), assuming words are
numbered starting with 1. (If they are numbered starting with 0, as
is more common when writing programs where the numbers are in an
indexed structure such as an array, then \([i,j]\) covers words \(i\)
through \(j-1\).)
[Back]
Note Symmetry
This formulation ignores length effects. If the expected ratio of
lengths of a sentence in language \(F\) to the length of its translation
into \(E\) \(\neq 1.0\), then there is a tug one direction or
the other. But we'll neglect this fact.
[Back]
Note Certified
The data look somewhat weird; just how good are our "certified
translators"? Not good at all; the numbers were actually produced by
running a Markov model. It preserves locality pretty well —
perhaps too well. There are a lot of runs of the same
number, and too many \(E\) words are never accounted for. In
practice, some of these problems are overcome by the fact that
alternative hypotheses are competing, and other factors can choose
among alternatives that the HMM finds about equally likely.
[Back]