Project 3: Cooccurrence Matrices
Objectives
- to use a map ADT
- to implement a map ADTs using a hash table
- to implement balanced binary search trees
- to implement algorithms efficiently
Introduction
The relationship between words can be described by relationships between vectors of real numbers used to represent those words. One simple way to represent a word as a vector is to compute how often it co-occurrs with other words – for each other word, how may contexts ("context" could mean sentence, paragraph, or document) that contain it also contain that other word.
For example, suppose we are interested in computing the co-occurrence matrix for the words ball, cold, play, sun, and wet. If our text is
the sun did not shine it was too wet to play so we sat in the house all that cold cold wet day i sat there with sally we sat there we two and i said how i wish we had something to do too wet to go out and too cold to play ball so we sat in the house we did nothing at allThe Cat in the Hat by Dr. Seuss
and by "context" we mean "sentence" (each sentence is on a separate line in the above text) then the resulting matrix is
[1.000000, 1.000000, 1.000000, 0.000000, 1.000000] [0.500000, 1.000000, 0.500000, 0.000000, 1.000000] [0.500000, 0.500000, 1.000000, 0.000000, 1.000000] [0.000000, 0.000000, 0.000000, 1.000000, 0.000000] [0.333333, 0.666667, 0.666667, 0.000000, 1.000000]where each row gives the vector for one word. The last row then corresponds to "wet" and tells us that one third of the sentences that contained "wet" also contained "ball", two-thirds contained "cold", two-thirds contained "play", and none contained "sun" (and, of course, all contained "wet").
Explore further with Vector Representation of Words from TensorFlow.org or "King-Man+Woman=Queen: The Marvelous Mathematics of Computational Linguistics" in MIT Technology Review. And, for a similar idea in a different context, Dissecting Trump's Most Rabid Online Following from FiveThirtyEight.
Assignment
There are three parts to this assignment.
First, complete the implementation of the functions in /c/cs223/hw93/Required/cooccur.h
.
The functions should run in the following time bounds:
cooccur_create
: $O(n^2)$ where $n$ is the number of keywordscooccur_update
: $O(k^2)$ where $k$ is the number of keywords in the contextcooccur_read_context
: $O(m+c)$ where $m$ is the number of words in the context and $c$ is the number of characters readcooccur_get_vector
: $O(n)$ where $n$ is the number of keywords in the matrixcooccur_destroy
: $O(n)$ where $n$ is the number of keywords in the matrix
gmap
(meaning that for the purposes of determining
running times (and only for that purpose), assume that the length of
the keywords is constant), except it is your responsibility to implement
or choose
an appropriate hash function.
Second, write a program called Cooccur
that takes a
non-empty list of distinct keywords as command-line arguments,
reads a text from standard input, and prints to standard output
the coocurrence matrix for the keywords and text.
The keywords will be non-empty and will have
no embedded whitespace.
The text will contain
one context per line and the words in the text
will be separated by one or more whitespace characters. (So, for
both keywords and words in the text, "word" is any sequence of
non-whitespace characters, and so words may include, for example,
punctuation characters; we would probably apply a preprocessing step
to the text to eliminate punctuation if we were using this program
outside this class.)
The program should write to standard output the cooccurrence matrix
with the row for each keyword written in the order the keywords
appeared on the command line and with the columns also in the order
they keywords appeared on the command line. Each row should be output
preceded by the corresponding keyword, a colon, and a single space.
The values in each row should be enclosed in square brackets,
separated by a comma and a single space, and output in the default
format for a double
(that is, using
the "%lf"
format specifier with printf
). If
a keyword does not appear in the text then its row should contain all
zeros.
Finally, complete the implementation of the functions in /c/cs223/hw93/Required/gmap.h
to implement a map from generic keys to generic values so that
the functions run in the following time bounds:
gmap_create
: $O(1)$gmap_size
: $O(1)$ (worst-case)gmap_contains_key
: expected $O(1)$, worst-case or amortized $O(\log n)$gmap_get
: expected $O(1)$, worst-case or amortized $O(\log n)$gmap_put
: expected $O(1)$, amortized $O(\log n)$gmap_for_each
: $O(n)$ (assumingf
runs in $O(1)$ time)gmap_destroy
: $O(n)$
malloc
and free
run in $O(1)$ time. Expected running times are under the
assumption that the distribution of keys
and the hash function are such that the hash values of the keys
are distributed
randomly and uniformly across the range of an int
.
Your program must use only $O(n^2)$ space, where $n$ is the number of keywords.
Error Handling
If any of the input is invalid for any reason, then your program must behave gracefully (it must not crash or hang).
In all cases, your Cooccur
program must execute with no valgrind errors,
and your gmap
and
cooccur
modules must not introduce valgrind errors
into any program that uses those modules according to their specifications.
The behavior of your modules when the preconditions of the functions
are not met and when there are memory allocation errors is left
undefined and will not be tested.
Testing and Approach
There are unit tests for gmap
and cooccur
in /c/cs223/hw93/Required/gmap_unit.c
and
in /c/cs223/hw93/Required/cooccur_unit.c
from which
you must build GmapUnit
and CooccurUnit
executables. These unit tests are not intended to be complete, so
you should write your own tests as well.
Note that there is no bound on the length of keywords, words in the
input streams, or length of lines in the input streams, and that
standard input is not always seekable.
But since cooccur_read_context
checks whether a word is a keyword or not, the code in that
function need not ever store words of arbitrary length,
although it does need to be able to process
complete lines containing them. The sprintf
function
may be helpful here; see
the comp.lang.c FAQ
for the beginnings of one approach to reading input.
You may find it helpful to write validation functions for your data structures so you can detect violations of the invariants as soon as possible.
Example
[jrg94@cobra Cooccur]$ ./Cooccur ball cold play sun wet < cat_in_hat.txt ball: [1.000000, 1.000000, 1.000000, 0.000000, 1.000000] cold: [0.500000, 1.000000, 0.500000, 0.000000, 1.000000] play: [0.500000, 0.500000, 1.000000, 0.000000, 1.000000] sun: [0.000000, 0.000000, 0.000000, 1.000000, 0.000000] wet: [0.333333, 0.666667, 0.666667, 0.000000, 1.000000]where
cat_in_hat.txt
contains the text given above.
Submissions
Submit your makefile, your log file, and your source code necessary to build the executablesCooccur
, GmapUnit
,
and CooccurUnit
using your makefile
as assignment 93.
Your makefile should have targets for each executable and a default
target that builds all three (to achieve this, you can make the first rule
have target all
, prerequisites GmapUnit
,
CooccurUnit
,
and Cooccur
, and no commands; see Prof. Aspnes's notes for
an example, albeit with only one prerequisite for all
).
Your makefile should assume that the files from
/c/cs223/hw93/Required
and /c/cs223/hw93/Optional
have been copied into
the current directory but not that they are still in their original locations,
and you should not submit those files.