P R E L I M I N A R Y    S P E C I F I C A T I O N


                                          Due 2:00 AM, Friday, 9 December 2016

CS-223   Homework #7    Words - breaking up is hard to do

REMINDER: Do not under any circumstances copy another student's code
or give a copy of your code to another student.  After discussing the
assignment with another student, you may not take any written or
electronic record away.  Moreover, you must engage in a full hour of
mind-numbing activity before you work on it again.  Such discussions
must be noted in your log file.

breakingupishardtodo

We can read the above text, even though it is missing word boundaries.

breaking up is hard to do

Your job in this assignment is to solve the string segmentation problem.

For this assignment, you will create a program:

    Words [-dict filename] [-debug]

that reads standard input and writes to standard output.  It first
loads a word file (or dictionary) into a hash table, as you did in
hw3.  In fact, we will give you the hash table code from hw3 (hash.c,
hash.h, and Makefile). The word file will have one word per line.
When you load the words into the hash table, you will convert them to
lower case and ignore any words containing non-alphabetic characters,
such as digits or punctuation.

The default word file is simply "words" in the current directory.
This should make it simpler to test your code without having to load
400,000 words from the online dictionary every time you run your code.
However, your program should be able to process the online dictionary,
if given as a command line argument.

All error output (usage and "Fatal Error" messages below) should be
printed to standard error. For example,

   fprintf(stderr, "usage ...");
   
Note that not all errors are fatal.  All other output should be
printed to standard output.  (Use normal printf.)

There are many ways to approach this problem.  For starters, you might
try to see if you can break the string into two words.  You simply
divide the string into a left and right substring, and check to see if
both left and right parts are in the dictionary.  We will call this
the two word solution.  This is pretty efficient, but also limited.

Another approach is depth first search.  You find the first prefix
that is a word, and then recursively try to segment the remainder of
the string.  This is a general solution, but not very efficient.  For
long strings, you will like perform the same calculation repeatedly.

That should ring a bell in your head: dynamic programming.  You want
your program to avoid repetitive calculations.  Here is video that
demonstrates one way to do this.

  https://www.youtube.com/watch?v=WepWFGxiwRs

The current solution, Wordsx, implements both the two word and the DP
approaches.  

Below is some sample output.

% cat words
I
a
am
ace
mace
ma
mama
for
ever
forever
car
rot
carrot
% ./Words
mama
Two words: SUCCESS:  ma ma.
DP: SUCCESS:
mama
---
mamama
Two words: SUCCESS:  ma mama. mama ma.
DP: SUCCESS:
ma mama
---
forevercarrot
Two words: SUCCESS:  forever carrot.
DP: SUCCESS:
for ever carrot
---
Iamace
Two words: FAILURE
DP: SUCCESS:
i a mace
---
xxxx
Two words: FAILURE
DP: FAILURE
---
notindictionary
Two words: FAILURE
DP: FAILURE
---

Note that at this point, Words prints all possible two word
segmentations, but not all possible segmentations. 

Here is output with the debug flag engaged, using the example from the
video.

% ./Words -debug
Loading dictionary: words
Word Count: 13
Iamace
Input: Iamace
Two words: FAILURE

0 1 1 1 -1 1 
-1 1 1 2 -1 2 
-1 -1 -1 2 -1 2 
-1 -1 -1 3 -1 3 
-1 -1 -1 -1 -1 -1 
-1 -1 -1 -1 -1 -1 
DP: SUCCESS:
i a mace
---

It prints out the matrix.  Note that the matrix is slightly different
from the video in part because "mace" is not in his dictionary.


Words should:

* Have no memory leaks.  You will need to use dynamic memory
  allocation, e.g., malloc().  You want to make sure that you free up
  memory before termination.  Use valgrind to detect any memory
  problems.  It will also detect other kinds of memory lapses, such as
  reading or writing to unauthorized parts of memory.

* Fail "gracefully" (i.e., neither go into an infinite loop nor cause a memory
  dump) if any of the assumptions above is violated.

Reading: You should review the relevant Aspnes sections on dynamic
programming.

Use the submit command (see below) to turn in the source file(s) for
Words.c and your log file.  Do not submit the hash.c, hash.h, and
Makefile.  Your code should

       #include "/c/cs223/hw7/hash.h"

YOU MUST SUBMIT YOUR FILES (INCLUDING THE LOG FILE) AT THE END OF ANY SESSION
WHERE YOU SPEND AT LEAST ONE-HALF HOUR WRITING OR DEBUGGING CODE, AND AT LEAST
ONCE EVERY HOUR DURING LONGER SESSIONS.  (All submissions are retained.)

Notes
=====
1. When available, the public grading script will be /c/cs223/hw7/Tests/test.Words
   (and my solution will be /c/cs223/hw7/Wordsx).  To run it, type

     % /c/cs223/hw7/Tests/test.Words

   (here % is the shell prompt).  The script uses make to create Words.  To
   run each test it redirects the test file (e.g., /c/cs223/hw7/Tests/t01.c
   for Test #01) to the standard input of Words and redirects the standard
   output to a temporary file.  Then it compares this file with the expected
   output for that input (e.g., /c/cs223/hw7/Tests/t01.cs for Test #01).
   Your program passes the test only if the two files are identical.

   To run your program on the file for Test #01, type

     % ./Words < /c/cs223/hw7/Tests/t01.c

   To compare the output from your program with the expected output, type

     % ./Words < /c/cs223/hw7/Tests/t01.c | cmp - /c/cs223/hw7/Tests/t01.cs

   (cmp outputs the first character where the files differ) or

     % ./Words < /c/cs223/hw7/Tests/t01.c | diff - /c/cs223/hw7/Tests/t01.cs

   (diff outputs the lines where they differ but uses a looser definition for
   "identical") or

     %  /c/cs223/hw7/Tests/test.Words 01

   (you may specify more than one test here).

   If your output looks the same as what is expected, but your program still
   fails the test, there are probably some invisible characters in your output.
   To make all characters visible (except blanks), type

     % ./Words < /c/cs223/hw7/Tests/t01.c | cat -vet

   or

     % ./Words < /c/cs223/hw7/Tests/t01.c | od -bc

2. Keep track of how you spend your time in completing this assignment.  Your
   log file should be of the general form (that below is fictitious):

     ESTIMATE of time to complete assignment: 10 hours

           Time     Time
     Date  Started  Spent Work completed
     ----  -------  ----  --------------
     1/13  10:15pm  0:45  Read assignment and relevant material in K&R
     1/16   4:45pm  1:15  Sketched solution using a finite-state machine with
                            one-character look-ahead
     1/19   9:00am  2:20  Wrote the program and eliminated compile-time errors;
                            code passes eight tests
     1/20   7:05pm  2:00  Discovered and corrected two logical errors; code now
                            passes eleven tests
     1/23  11:00am  1:35  Finished debugging; program passes all public tests
                    ----
                    7:55  TOTAL time spent

     I discussed my solution with: Peter Salovey, Ben Polak, Tamar Gendler,
     and Jonathan Holloway (and watched four episodes of The Simpsons).

     <A brief discussion of the major difficulties encountered>

   but MUST contain

   * your estimate of the time required (made prior to writing any code),

   * the total time you actually spent on the assignment,

   * the names of all others (but not members of the teaching staff) with whom
     you discussed the assignment for more than 10 minutes, and

   * a brief discussion (100 words MINIMUM) of the major conceptual and coding
     difficulties that you encountered in developing and debugging the program
     (and there will always be some).

   This log will generally be worth 5-10% of the total grade.

   N.B.  To facilitate analysis, the log file MUST be the only file submitted
   whose name contains the string "log" and the estimate / total MUST be on the
   only line in that file that contains the string "ESTIMATE" / "TOTAL".

3. The submit program can be invoked in eight different ways:

     % /c/cs223/bin/submit 7  Words.c time.log heap.c

   submits the named source files as your solution to Homework #7;
   Note that even though you will NOT submit btree.h.  You should
   assume that your btree table implementation may be tested
   independently of Words.

     % /c/cs223/bin/check  7

   lists the files that you submitted for Homework #7;

     % /c/cs223/bin/unsubmit  7  error.submit bogus.solution

   deletes the named files that you submitted previously for Homework
   #5 (which is useful if you rename a file or accidentally submit the
   wrong one);

     % /c/cs223/bin/makeit  7  Words

   runs "make" on the files that you submitted previously for Homework #5;

     % /c/cs223/bin/testit  7  Words  [THIS DOES NOT WORK.  SEE ABOVE.]

   runs the public test script for Words using the files that you submitted
   previously for Homework #7;

     % /c/cs223/bin/protect  7  Words.c time.log heap.c

   protects the named files that you submitted previously for Homework #7 (so
   they cannot be deleted accidentally);

     % /c/cs223/bin/unprotect  7  util.c time.log

   unprotects the named files that you submitted previously for Homework #7 (so
   they can be deleted); and

     % /c/cs223/bin/retrieve  8  common.c time.log

   and

     % /c/cs223/bin/retrieve  8  -d"2016/01/21 20:00" btree.c

   retrieve copies of the named files that you submitted previously for
   Homework #8 (in case you accidentally delete your own copies).  The day
   and hour are optional and request the latest submission prior to that time
   (see the -d flag under "man co" for how to specify times).

4. When assignments are style graded, EVERY source file found in the
   submit directory will be reviewed.  Thus prudence suggests using
   unsubmit to remove a file from the directory when you change its
   name or it ceases to be part of your solution.  See
   http://zoo.cs.yale.edu/classes/cs223/doc/Style

   In your spare time, you might think about how to automate the tests
   in the online style sheet.  That would be a pretty good homework
   assignment.

5. Prudence (and a 5-point penalty for code that does not make)
   suggests that you run makeit ("makeit 7 Words") after you have
   submitted the final version of your source files.  


CS-223-11/25/16