P R E L I M I N A R Y S P E C I F I C A T I O N Due 2:00 am, Friday, 22 April 2016 CPSC 223b Homework #6 WEPL-Balanced Word-Count Trees REMINDER: Do not under any circumstances copy another student's code or give a copy of your code to another student. After discussing the assignment with another student, you may not take any written or electronic record away. Moreover, you must engage in a full hour of mind-numbing activity before you work on it again. Such discussions must be noted in your log file. (40) Write a program Words16 [TEXTFILE | -d DICTFILE | -print | -dump | -epl | -set VALUE]* that can be used to check for spelling errors (among other things). Words16 maintains a tree of (word,count) pairs and an improvement factor IF (initially 0), and accepts six types of arguments: Type Action Taken ---- ------------ TEXTFILE Read words from TEXTFILE; increment the count for those that are in the tree; insert with a count of 1 those that are not -d DICTFILE Read words from DICTFILE; delete from the tree those that are in the tree -print Write the (word,count) pairs in the tree to the standard output in inorder (that is, in alphabetic order), one word per line, preceded by its count, using the format "%3d %s\n" -dump Write the words (including those in internal nodes; see below) in the tree to the standard output in preorder, one word per line, using the format "%s". -epl Write the sum of the counts in the leaves and the weighted external path length for the tree to the standard output, using the format "%d, %d\n" -set VALUE Set the improvement factor IF to the integer VALUE Each command line argument is processed at the time it is encountered. Thus with the command % Words16 Essay -d SysDict -dump -set 2 Appendix -d UsrDict -print the words in the file Essay will be added to the tree, the words in the file SysDict deleted from the tree, the tree dumped in preorder, IF changed to 2 from the initial value of 0, the words in the file Appendix added to the tree, the words in the file UsrDict deleted from the tree, and the (word,count) pairs in the tree printed in alphabetic order. The tree should be implemented as a WEPL-Balanced Binary Search Tree (AKA a "2016" tree), which has some unusual properties: A) Either the tree is empty or every node has 0 or 2 children. B) The (word,count) pairs are stored in the leaves of the tree; the words in the internal nodes are used only for searching. Thus unless the tree is empty, inserting a (word,count) pair into the tree creates a new internal node. For example, inserting the pair ("FOO",1) into the one-node tree ("BAR",2) gives the tree "BAR" / \ ("BAR",2) ("FOO",1) (by Property C below, the word in the internal node must be the smaller of the two words). Similarly, deleting a (word,count) pair from the tree deletes the parent of that leaf and makes its sibling the child of its grandparent. For example, deleting "FOO" from the tree causes "YUK" "YUK" "FOO" "FOO" / \ / \ / \ / \ "BAR" T2 => T1 T2 OR "BAR" T2 => T1 T2 / \ / \ T1 ("FOO",2) T1 ("FOO",2) where T1 and T2 denote nonempty subtrees. Note that after deletion the word might still exist in an internal node as in the second example above. C) All words in the left subtree of a node (whether in leaves or in internal nodes) are less than or equal to the word in the node; and all words in the right subtree are greater than it. D) The weight W(N) of a node N is the sum of the counts in the leaf nodes of the subtree rooted at N and is stored in N. The weight W(T) of a subtree is the weight of its root. E) The weighted external path length WEPL(T) of a subtree T is the weighted sum of the lengths of the paths from its root to each of its leaves, with the weights being the values of count. WEPL(T) is stored in the root of T. F) To improve the average search time the standard algorithms for search, insert, and delete are modified as follows: Search/Insert: Assume that the word W was found (and its count incremented) or inserted in the subtree T rooted at N. If the leaf containing W is in the left [right] subtree of T, then rotate N to the right [left] whenever this rotation would reduce WEPL(T) by more than IF; that is, letting T' denote the subtree after the rotation, whenever WEPL(T') < WEPL(T) - IF. Delete: Assume that the word W was deleted from the subtree T rooted at N. If the node that contained W is the left [right] subtree of T, then rotate N to the left [right] whenever this rotation would reduce WEPL(T) by more than IF; that is, letting T' denote the subtree after the rotation, whenever WEPL(T') < WEPL(T) - IF. Note that these rotations cannot make a leaf internal, and that the modifications above are applied as the recursive search/insert/delete unwinds. When IF is 0, these changes are intended to reduce the weighted external path length (and thus the average successful search time) whenever possible. (Whether these intentions are met is unknown.) Examples: % Words16 sectionOne sectionTwo sectionThree -d /usr/share/dict/words -print % Words16 chapter -d /c/cs223/Doc/words.big -print -d myDictionary -print % Words16 wordList -print -epl % Words16 -set 3 novella -d /usr/share/dict/linux.words -dump Use the submit command to turn in the source files for Words16, a Makefile, and your log file (see Homework #1). YOU MUST SUBMIT YOUR FILES (INCLUDING THE LOG FILE) AT THE END OF ANY SESSION WHERE YOU SPEND AT LEAST ONE-HALF HOUR WRITING OR DEBUGGING CODE, AND AT LEAST ONCE EVERY HOUR DURING LONGER SESSIONS. (All submissions are retained.) Notes ~~~~~ 1. In reading words from lines, Words16 should - treat all non-alphanumeric characters as whitespace; - convert all letters to lower case; - treat every maximal, contiguous sequence of alphanumeric characters as a word; - use strcmp() to define the order on words. 2. A dictionary can be ANY file (that is, there is no requirement that the words be in alphabetical order or that they appear one per line). 3. Words16 should detect illegal command-line arguments and nonexistent or unreadable files, printing one-line error messages to stderr and exiting. Errors may be reported when they are detected (vs. at the beginning). 4. Words16 may use the getLine(FILE*) function implemented in Hwk3/getLine.h and Hwk3/getLine.o to read lines from a file given an existing FILE pointer. If it does, your source file(s) should #include /c/cs223/Hwk3/getLine.h rather than the name of a local copy. Similarly your Makefile should specify Hwk3/getLine.o rather than a local copy. 5. The 2016 tree should be written as an abstract data type (that is, with a .h file to define the interface and the implementation in a separate .c file) so that the code can be used in other applications. The implementation may NOT use parent pointers (that is, there can only be two node pointers per node) or arrays (other than to hold copies of words). Hwk6/protoTree16.h defines an interface to an ADT for a standard binary search tree of which Hwk6/protoTree16.c is a nearly complete implementation. These are based on the class handout (cf. Hwk6/webTree). You may use any or all of this code without attribution (but you are responsible for its style!). 6. Search, insert, and delete must run in O(height(tree)) time. This means that the values W(T) and WEPL(T) stored in each node of the tree must be updated in O(1) time per node visited. 7. Your implementation of Words16 may not use any non-static global variables. should free malloc()-ed storage before overwriting the last pointer to it, and should close each FILE pointer after reading the file. However, you need not free all storage before exiting. 8. The purpose of the -epl option is to allow one to compute the average search time. The values printed are * WN = the sum of the counts in the leaves of the tree * WEPL = the weighted external path length, which is the sum over all of the leaves of the length of the path from the root to that leaf Thus WEPL is the total time to search for each leaf a number of times equal to its count, and WEPL / WN is the average time for each of these searches. 9. You can use the -dump option in Words16 to deduce the shape of the tree: The first word output is the word in the root; the words following (if any) that are less than or equal to that word are a dump of the left subtree; and the remaining words are a dump of the right subtree. For example, if % Hwk6/Words16 file -dump b a a b c c d d e e f then the tree must be . b . / \ . a c . / \ / \ . a b c d . / \ . d e . / \ . e f Using -print gives the weight of each leaf, and -epl gives WN and WEPL. A. You may find it easier to get Words16 working with standard binary search trees first and then modify the tree module to support the modifications to searchInsert() and delete() needed for 2016 trees. B. The use of 2016 trees rather than standard binary search trees will be tested in at most 20 tests. C. Point to Ponder: Could double rotations improve the performance? When should they be used? D. Reading: Van Wyk, Chapter 7 (binary trees) Van Wyk, Chapter 9 (AVL trees & B-trees) Van Wyk, Chapter 10 (heaps) CS-223-04/05/16