Programming Assignment #4

> > > > Programming Assignments > Assignment #4 - Cooccurrence Matrix

Objectives

to implement a map ADT using hash tables
to use maps
to implement algorithms efficiently

Introduction

The relationship between words can be described by relationships between vectors of real numbers used to represent those words. One simple way to represent a word as a vector is to compute how often it co-occurrs with other words – for each other word, how often it appears in the same context (perhaps sentence, paragraph, or document) as that other word.

For example, suppose we are interested in computing the co-occurrence matrix for the words ball, cold, play, sun, and wet. If our text is

the sun did not shine
it was too wet to play
so we sat in the house all that cold cold wet day
i sat there with sally
we sat there we two
and i said how i wish we had something to do
too wet to go out and too cold to play ball
so we sat in the house
we did nothing at all

The Cat in the Hat by Dr. Seuss
and by "context" we mean "sentence" (each sentence is on a separate line in the above text) then the resulting matrix is

[1.000000, 1.000000, 1.000000, 0.000000, 1.000000]
[0.500000, 1.000000, 0.500000, 0.000000, 1.000000]
[0.500000, 0.500000, 1.000000, 0.000000, 1.000000]
[0.000000, 0.000000, 0.000000, 1.000000, 0.000000]
[0.333333, 0.666667, 0.666667, 0.000000, 1.000000]

where each row gives the vector for one word. The last row then corresponds to "wet" and tells us that one third of the sentences that contained "wet" also contained "ball", two-thirds contained "cold", two-thirds contained "play", and none contained "sun" (and, of course, all contained "wet").

Explore further with Vector Representation of Words from TensorFlow.org or "King-Man+Woman=Queen: The Marvelous Mathematics of Computational Linguistics" in MIT Technology Review. And, for a similar idea in a different context, Dissecting Trump's Most Rabid Online Following from FiveThirtyEight.

Assignment

There are three parts to this assignment.

First, complete the implementation of the functions in smap.h to implement a map from strings to pointers to ints. The functions should run in the following time bounds:

smap_create: $O(1)$
smap_size: $O(1)$ (worst-case)
smap_contains_key: $O(1)$
smap_get: $O(1)$
smap_put: $O(1)$
smap_for_each: $O(n)$ (assuming f runs in $O(1)$ time)
smap_for_each_r: $O(n)$ (assuming f runs in $O(1)$ time)
smap_destroy: $O(n)$

Where, unless otherwise noted, the running times are expected running times under the assumption that there is a bound on the length of the words and that the distribution of keys and the hash function are such that the hash values of the keys are distributed randomly and uniformly across the range of an int (and that malloc, free, etc. run in $O(1)$ time).

Second, complete the implementation of the functions in cooccur.h. The functions should run in the following time bounds:

cooccur_create: $O(n^2)$ where $n$ is the number of keywords
cooccur_update: $O(k^2)$ where $k$ is the number of keywords in the context
cooccur_read_context: $O(m+c)$ where $m$ is the number of words in the context and $c$ is the number of characters read
cooccur_get_vector: $O(n)$ where $n$ is the number of keywords in the matrix
cooccur_destroy: $O(n)$ where $n$ is the number of keywords in the matrix

Where the running times are expected running times under the same assumptions as above for smap, except it is your responsibility to implement an appropriate hash function.

Finally, write a program called Cooccur that takes a non-empty list of distinct keywords as command-line arguments (keywords will be non-empty and will have no embedded whitespace), reads a text from standard input with one context per line and words in the text separated by one or more whitespace characters, and writes to standard output the cooccurrence matrix with the row for each keyword written in the order the keywords appeared on the command line and with the columns also in the order they keywords appeared on the command line. Each row should be output preceded by the corresponding keyword, a colon, and a single space. The values in each row should be enclosed in square brackets, separated by a comma and a single space, and output in the default format for a double (that is, using the "%lf" format specifier with printf). If a keyword does not appear in the text then its row should contain all zeros.

For example,

[jrg94@cobra Cooccur]$ ./Cooccur ball cold play sun wet < cat_in_hat.txt
ball: [1.000000, 1.000000, 1.000000, 0.000000, 1.000000]
cold: [0.500000, 1.000000, 0.500000, 0.000000, 1.000000]
play: [0.500000, 0.500000, 1.000000, 0.000000, 1.000000]
sun: [0.000000, 0.000000, 0.000000, 1.000000, 0.000000]
wet: [0.333333, 0.666667, 0.666667, 0.000000, 1.000000]

where cat_in_hat.txt contains the text given above.

Your program must use only $O(n^2)$ space, where $n$ is the number of keywords.

If the input is not as specified or testable preconditions to the ADT functions are violated then your program must behave gracefully – it must not crash or go into an infinite loop.

Your program and any program that uses your STRING-TO-INT-POINTER or COOCCURRENCE-MATRIX ADTs correctly must not produce any error messages when run with valgrind --tool=memcheck --leak-check=yes -q.

We reserve the right to deduct points from submissions for violations of the following guidelines.

All source code should be easily comprehended, with adequate and consistent spacing, concise but descriptive idenitifiers, comments before long sequences of code, and enumerations and constants favored over literals.
Submissions should be modularized, with collections of related functions divided into separate files.
Each function should be documented with a comment before its declaration. The comment should include 1) a description of the parameters including any limitations on the values allowed (preconditions), and 2) a description of what the function does and/or what it returns (postconditions).
Header files should be written so that they can be #included regardless of what other files have been #included -- if a declaration is required by your header files, then #include the appropriate header file in your header file (or write a forward declaration in your header) rather than relying on your #includers to have already #included those header files before they #include yours.
Things that make sense to be declared as const should be declared as const.
There should be no warnings when your code is compiled with the -Wall and -pedantic options.

Files

Files are available in /home/httpd/html/zoo/classes/cs223/f2017/Assignments/Cooccur:

smap.h (do not change)
smap.c (an implementation of smap using an unsorted fixed-size array; can be used as a starting point for your efficient implementation and for testing cooccurrence_matrix without a completed efficient implementation)
cooccur.h (do not change)
word_count.c (an application that uses smap and so can be used as a non-trivial but non-comprehensive test for smap)