CPSC 223: Homework 7. Words - breaking up is hard to do

Due 2:00 AM, Wednesday, 3 May 2017 (extension)

R E V I S E D S P E C I F I C A T I O N

REMINDER: Do not under any circumstances copy another student's code or give a copy of your code to another student. After discussing the assignment with another student, you may not take any written or electronic record away. Moreover, you must engage in a full hour of mind-numbing activity before you work on it again. Such discussions must be noted in your log file.

breakingupishardtodo

We can read the above text, even though it is missing word boundaries.

breaking up is hard to do

Your job in this assignment is to solve the string segmentation problem.

Reading

Program Specification

For this assignment, you will create a program:
    Words [-dict filename] [-debug]
that reads standard input and writes to standard output. It first loads a word file (or dictionary) into a hash table, as you did in hw3. In fact, we will give you the hash table code from hw3 (hash.c, hash.h, and Makefile). The word file will have one word per line. When you load the words into the hash table, you will convert them to lower case and ignore any words containing non-alphabetic characters, such as digits or punctuation.

The default word file is simply "words" in the current directory. This should make it simpler to test your code without having to load 400,000 words from the online dictionary every time you run your code. However, your program should be able to process the online dictionary, if given as a command line argument.

All error output (usage and "Fatal Error" messages below) should be printed to standard error. For example,

   
   fprintf(stderr, "usage ...");
Note that not all errors are fatal. All other output should be printed to standard output. (Use normal printf.)

There are many ways to approach this problem. For starters, you might try to see if you can break the string into two words. You simply divide the string into a left and right substring, and check to see if both left and right parts are in the dictionary. We will call this the two word solution. This is pretty efficient, but also limited.

Another approach is depth first search. You find the first prefix that is a word, and then recursively try to segment the remainder of the string. This is a general solution, but not very efficient. For long strings, you will likely perform the same calculation repeatedly.

That should ring a bell in your head: dynamic programming. You want your program to avoid repetitive calculations.

The current solution, Wordsx, implements both the two word and the DP approaches, using the following DP algorithm:

  1. Initialize an array matrix[strlen(word) + 1] = -1. Let matrix[strlen(word)] = 0. -1 shall be a sentinel value signifying that there is not a valid word split starting at this index onwards.
  2. Let i be the index moving backwards from the end of the word to 0. It represents the start of a word split. For each index i, check if there is a valid word split iterating from word[i, i] to word[i, strlen(word)].
  3. If there is a valid split, let j be the index of the end of that valid split.
  4. If the substring[i, j] is a valid word and matrix[j + 1] != -1 (e.g. there is a sequence of valid word splits from j + 1 to the end of the string. Given that i iterates backwards from the end of the word, if there was a valid split at starting at j + 1, we would have already found it!), let matrix[i] = j + 1, e.g. the index of the start of the next word.
Below is some sample output.
$ cat words 
I
a
am
ace
mace
ma
mama
for
ever
forever
car
rot
carrot
breaking
break
up
is
hard
to
do
$ ./Words
mama
Two words: SUCCESS: ma ma.  
DP: SUCCESS: mama 
---
mamama
Two words: SUCCESS: ma mama.  mama ma.  
DP: SUCCESS: mama ma 
---
forevercarrot
Two words: SUCCESS: forever carrot.  
DP: SUCCESS: forever carrot 
---
Iamace
Two words: FAILURE
DP: SUCCESS: i am ace 
---
xxxx
Two words: FAILURE
DP: FAILURE
---
notindictionary
Two words: FAILURE
DP: FAILURE
---
Note that at this point, Words prints all possible two word segmentations, but not all possible segmentations.

Here is from the algorithm example, including the debug flag.

$ ./Words -debug
Loading dictionary: words
Word count: 20
breakingupishardtodo
Input: 'breakingupishardtodo'
Two words: FAILURE
DP: SUCCESS: breaking up is hard to do 
8 -1 -1 -1 -1 -1 -1 -1 10 -1 12 -1 16 -1 -1 -1 18 -1 20 -1 
---
The debug flag causes the program to print out the array and other information.

We can also feed the test words from a file using redirected input.

$ cat test
mama
mamama
breakingupishardtodo
$ ./Words -debug < test
Loading dictionary: words
Word count: 20
Input: 'mama'
Two words: SUCCESS: ma ma.  
DP: SUCCESS: mama 
4 3 4 4 
---
Input: 'mamama'
Two words: SUCCESS: ma mama.  mama ma.  
DP: SUCCESS: mama ma 
4 3 6 5 6 6 
---
Input: 'breakingupishardtodo'
Two words: FAILURE
DP: SUCCESS: breaking up is hard to do 
8 -1 -1 -1 -1 -1 -1 -1 10 -1 12 -1 16 -1 -1 -1 18 -1 20 -1 
---
Words should:

Use the submit command (see below) to turn in the source file(s) for Words.c and your log file. Do not submit the hash.c, hash.h, and Makefile. Your code should

       #include "/c/cs223/hw7/hash.h"
YOU MUST SUBMIT YOUR FILES (INCLUDING THE LOG FILE) AT THE END OF ANY SESSION WHERE YOU SPEND AT LEAST ONE-HALF HOUR WRITING OR DEBUGGING CODE, AND AT LEAST ONCE EVERY HOUR DURING LONGER SESSIONS. (All submissions are retained.)

Notes

  1. When available, the public grading script will be /c/cs223/hw7/Tests/test.Words (and my solution will be /c/cs223/hw7/Wordsx). To run it, type
    % /c/cs223/hw7/Tests/test.Words
    
    You may invoke a given test as follows:
    %  /c/cs223/hw7/Tests/test.Words 01
    
    (you may specify more than one test here).

    If your output looks the same as what is expected, but your program still fails the test, there are probably some invisible characters in your output.

  2. Keep track of how you spend your time in completing this assignment. Your log file should be of the general form (that below is fictitious):
         ESTIMATE of time to complete assignment: 10 hours
    
               Time     Time
         Date  Started  Spent Work completed
         ----  -------  ----  --------------
         1/13  10:15pm  0:45  Read assignment and relevant material in K&R
         1/16   4:45pm  1:15  Sketched solution using a finite-state machine with
                                one-character look-ahead
         1/19   9:00am  2:20  Wrote the program and eliminated compile-time errors;
                                code passes eight tests
         1/20   7:05pm  2:00  Discovered and corrected two logical errors; code now
                                passes eleven tests
         1/23  11:00am  1:35  Finished debugging; program passes all public tests
                        ----
                        7:55  TOTAL time spent
    
         I discussed my solution with: Peter Salovey, Ben Polak, Tamar Gendler,
         and Jonathan Holloway (and watched four episodes of The Simpsons).
    
         *A brief discussion of the major difficulties encountered*
    
    but MUST contain This log will generally be worth 5-10% of the total grade.

    N.B. To facilitate analysis, the log file MUST be the only file submitted whose name contains the string "log" and the estimate / total MUST be on the only line in that file that contains the string "ESTIMATE" / "TOTAL".

  3. The submit program can be invoked in eight different ways:
    % /c/cs223/bin/submit 7  Makefile Words.c time.log
    
    submits the named source files as your solution to Homework #7
    % /c/cs223/bin/check  7
    
    lists the files that you submitted for Homework #7;
     % /c/cs223/bin/unsubmit 7  error.submit bogus.solution
    
    deletes the named files that you submitted previously for Homework #7 (which is useful if you rename a file or accidentally submit the wrong one);
    % /c/cs223/bin/makeit 7 Words
    
    runs "make" on the files that you submitted previously for Homework #7;
    % /c/cs223/bin/testit 7 Words
    
    runs the public test script for Words using the files that you submitted previously for Homework #7; This might not be working. Use the testing instructions given above.
    % /c/cs223/bin/protect  7  Words.c time.log
    
    protects the named files that you submitted previously for Homework #7 (so they cannot be deleted accidentally);
    % /c/cs223/bin/unprotect  7  util.c time.log
    
    unprotects the named files that you submitted previously for Homework #7 (so they can be deleted); and
    % /c/cs223/bin/retrieve  8  common.c time.log
    
    and
    % /c/cs223/bin/retrieve  8  -d"2017/01/21 20:00" util.c
    
    retrieve copies of the named files that you submitted previously for Homework #8 (in case you accidentally delete your own copies). The day and hour are optional and request the latest submission prior to that time (see the -d flag under "man co" for how to specify times).
  4. When assignments are style graded, EVERY source file found in the submit directory will be reviewed. Thus prudence suggests using unsubmit to remove a file from the directory when you change its name or it ceases to be part of your solution. See http://zoo.cs.yale.edu/classes/cs223/doc/Style

    Prudence (and a 5-point penalty for code that does not make) suggests that you run makeit ("makeit 7 Words") after you have submitted the final version of your source files. Better yet, run testit ("testit 7 Words").

  5. The function exit() allows your program to stop immediately, without having to terminate any surrounding loops or to return to main() from a function. (To use it you must #include the header file <stdlib.h>.)