P R E L I M I N A R Y S P E C I F I C A T I O N Due 2:00 AM, Friday, 13 November 2020 CPSC 323 Homework #4 An Ounce of Compression REMINDER: Do not under any circumstances copy someone else's code for this assignment, give your code to someone else, or make it publicly available. After discussing any aspect of the assignment with anyone other than a member of the teaching staff (such discussions must be noted in your log file), do not keep any written or electronic record and engage in some mind-numbing activity before you work on the assignment again. Sharing ANY related document (e.g., code or test cases), is a violation of this policy. Since code reuse is an important part of programming, you may study published code (e.g., from textbooks or the Net) and/or incorporate it in a program, provided that you give proper attribution in your log file and in your source files (see the syllabus for details) and that the bulk of the code submitted is your own. Note: Removing/rewriting comments, renaming functions/variables, or reformatting statements does not convey ownership. (60 points) Write file compression and decompression filters: % encode [-m MAXBITS] [-p] % decode encode reads a stream of characters from the standard input, compresses it using the Lempel-Ziv-Welch algorithm, and writes the stream of codes to the standard output as a stream of bits packed into 8-bit bytes. decode reads from the standard input a byte stream written by encode, decompresses the stream of codes, and writes the stream of characters to the standard output. encode writes each code using the smallest number of bits required to specify valid codes at the time (e.g., 9 bits when there are 512 valid codes, but 10 bits once the next code is assigned), up to a maximum of 12 bits (or MAXBITS if the -m MAXBITS option is specified). This places a limit on the number of codes that can be assigned. encode begins with a string table that assigns a code to every one-character string and assigns new codes to other strings that it finds during the greedy parse. Normally it stops assigning codes when the table is full (i.e., there are no unassigned codes). But with the -p (pruning) option, it instead creates a new string table that contains: * every one-character string; and * every string that is a prefix of another code in the old table. which has the effect of removing codes that were never used. Note that the codes for some strings may change since new codes are assigned sequentially as strings are inserted in the new table. encode and decode are a single C program (i.e., they are hard links to the same inode) whose behavior is determined by the name under which it is invoked (i.e., the 0-th command-line argument). The -m and -p options may appear in any order and any number of times, with the last occurrence of each option superseding all earlier ones. Use the submit command to turn in your log file (see Homework #1) and the source files for encode/decode (including a Makefile) as assignment 4. YOU MUST SUBMIT YOUR FILES (INCLUDING YOUR LOG FILE) AT THE END OF ANY SESSION WHERE YOU WRITE OR DEBUG CODE, AND AT LEAST ONCE EVERY HOUR DURING LONGER SESSIONS. (All submissions are retained.) Notes ~~~~~ A. encode writes a one-line message to the standard error and exits when an invalid option is specified. decode writes a message and exits if there is a command-line option or if it detects a file that encode could not have written (and that would cause a valgrind error). [optional] encode treats as an error a MAXBITS that is not a sequence of decimal digits or is not representable using a long int. B. If MAXBITS <= 8 (= CHAR_BIT in ) or MAXBITS > 20 (= 2*CHAR_BIT+4), then encode replaces MAXBITS by 12 (= CHAR_BIT+4). C. decode does not accept the -m and -p command-line options since this is more convenient for the user. (What would happen if you forgot the value of MAXBITS for a particular compressed file?) Thus this information must be represented in the output from encode. For example, to represent MAXBITS: a. The first byte could contain the maximum code size. b. A special code (e.g., 0 or 111...111) could signal that the number of bits per code is changing. The latter has the added benefit of keeping encode and decode synchronized with respect to the number of bits sent per code. Or encode and decode could use both approaches. D. The string table consists of (CODE, PREFIX, CHAR) triples. Decode must find PREFIX and CHAR given a (nonempty) CODE, which is implemented most easily using an array of structs indexed by CODE. But encode must find CODE given the pair (PREFIX, CHAR). Since linear search is prohibitively expensive for large arrays, encode uses a faster search algorithm: hashing with chaining. A possible hash function HASH(PREF,CHAR) is (((unsigned long)(PREF) << CHAR_BIT) | (unsigned)(CHAR)) % SIZE where SIZE is the number of chains and should be odd. E. encode and decode should be economical in their use of memory. Thus the string table should have room for 512 entries initially (256 if there are no special codes) and double in size when it is full (without assigning new codes to strings), up to the limit of 2^MAXBITS. Similarly, the number NCHAINS of chains in the hash table should be small initially and roughly double as the string table doubles to keep the load average low (e.g., NCHAINS = 2^(NBITS-3) - 1, where NBITS is the number of bits needed to send all existing codes). Here are some hints on how to achieve a small footprint: 1) If you allocate one array for all of the nodes in the hash table other than the heads of the chains, how many bytes do you need to specify a node? 2) Do you need separate arrays of structs for the (PREF,CHAR) pairs and the hash nodes? 3) What are bit fields? 4) Implement encode and decode without worrying about how much storage you are using and reduce it later when you have working code. F. Normally the standard input and the standard output in a C program are buffered. The Standard I/O Library calls malloc() to allocate the buffers (whose length is 4096 or 8192 bytes) when each stream is first accessed (e.g., by a call to getchar(), scanf(), putchar(), or printf()). The tests for economical use of storage count _all_ storage allocated, including that for buffers. This could push your program over the limit imposed by one or more of these tests. To reduce the size of the buffers and allocate them statically (i.e., not on the heap or stack), insert the following lines at the beginning of main(): setvbuf (stdin, bin, _IONBF, 0); setvbuf (stdout, bout, _IONBF, 0); G. Your solution should incorporate the following files (see also Hint #A): Hwk4/code.h Header file defining putBits(), flushBits(), and getBits(), functions that write/read a stream of bits packed in bytes Hwk4/code.c Source file for putBits(), flushBits(), and getBits() Hwk4/code.o Object file for putBits(), flushBits(), and getBits() However, your Makefile and sources must reference the originals; e.g., #include "/c/cs323/Hwk4/code.h" rather than #include "code.h" When the environment variable STAGE exists and is equal to 1, Hwk4/code.o writes/reads codes using the format "%d\n". When STAGE is equal to 2, it writes/reads the number of bits and the code using the format "%d:%d\n". You may find this useful when debugging. (See Hwk4/code.h for details.) A bash command like % STAGE=2 /c/cs323/Hwk4/encode < ... will direct Hwk4/encode (or Hwk4/decode) to use the representation of the code sequence specified above. Moreover, if the environment variable DBG is set, it will dump the final table in a human-readable form to the file ./DBG.encode or ./DBG.decode, respectively. H. encode and decode should be relatively bombproof, but may assume that malloc(), calloc(), and realloc() never fail. While heap storage need not be free()-ed before exiting, it must all be reachable at that time. I. encode should handle both text and binary files (i.e., stdin may contain any of the 256 possible byte values, including the NUL character '\0'). J. Do NOT use the pow() or log() functions; use the shift operator << or a loop instead. K. The assignment is called lzw, so the test script will be invoked as % /c/cs323/Hwk4/test.lzw or as % /c/cs323/bin/testit 4 lzw L. The degree of LZW compression (i.e., the length of the output from encode) depends on the file; the value of MAXBITS (and how this value is represented in the output from encode); and the number of special codes (e.g., EMPTY, GROW, or PRUNE). Thus all tests of size will be in comparison with that given by Hwk4/encode (which the scripts assume is a correct implementation of LZW) and will be relatively loose (at least 1%). M. The following capabilities will be worth at most the number of points shown below (these values include tests that require two or more capabilities): * (16 points) increasing the code size from 9 to MAXBITS bits (vs. always using MAXBITS bits) * (20 points) implementing the -m option * (12 points) implementing the -p option * (12 points) handling binary files * (12 points) keeping the storage requirements low (see Note E) + The number of bytes should <= 9*SIZE, where SIZE = 2^NBITS is the number of slots in the string table when sending NBITS bits. + The penalty for using more than 9*SIZE bytes is 3 points. + The penalty for using more than 13*SIZE bytes is 3 points. + The penalty for using more than 17*SIZE bytes is 3 points. + The penalty for using more than 21*SIZE bytes is 3 points. N. The reference solution is ~170 lines. Hints on Encode/Decode ~~~~~~~~~~~~~~~~~~~~~~ A. It is MUCH easier to write the core of encode/decode in three stages: Stage 1: encode writes codes as ASCII integers (e.g., 66), one per line, and decode reads codes in the same format. That is, run with STAGE=1; see Hwk4/code.h. Stage 2: encode writes codes in the form NBITS:CODE, where NBITS and CODE are ASCII integers (e.g., 10:66), and decode reads codes in the same format, checking that the number of bits expected agrees with NBITS. That is, run with STAGE=2. Stage 3: encode writes codes as a stream of bits, packed into chars, and decode reads codes in the same format. That is, run with STAGE undefined or having some value other than 1 or 2. Since the "compressed" data is human-readable in Stages 1 and 2, code is easier to debug. Moreover, at most half of the tests in the final script (those labelled "Compresses?") require being in Stage 3. Warning: When implementing a new capability (e.g., increasing code size or the -p option), revert to STAGE=1 or STAGE=2 even if the rest of your code works at a higher level. You could also start with Stage 0, where MAXBITS is fixed at 9 and linear search (which is easy to write yet not prohibitively expensive for at most 512 items) is used to search for CODE given (PREF,CHAR). Or you could start with Stage -1, which is Stage 0 using an alphabet with only three letters. B. While the pseudo-code for decode uses a stack to write the string of characters corresponding to a given code, you may find it easier to use a recursive function instead. The amount of code is roughly the same. if you do use a stack, you may find it easier to implement it using an array rather than a linked list. C. Use int's or short's rather than char's to store character values, at least initially. The increase in storage is minor, but experience suggests that the savings in debugging time can be major. D. When encode writes a code, the number of bits depends on the number of codes assigned, not the value of the code. decode knows this number, so it can deduce how many bits encode sent. E. You may find it easier to get everything working with NBITS fixed at MAXBITS so that the size of the codes sent/received never changes. As noted in the specification, the penalty for doing so will be at most 16 points, but the rest of the code should be much easier to write. F. The notes above give two schemes for keeping encode and decode synchronized with respect to the number of bits per code (= NBITS). The second adds a few extra codes to the output (e.g., one for each change to NBITS), but makes it easier for decode to know when to change NBITS. Then again you could use both schemes. G. The LZW handout gives two ways to handle the KwKwK problem. Assume that the encode clock starts at 0 and ticks every time that encode writes a code; and the decode clock starts at 0 and ticks every time that decode reads a code. * PATCH: When a given string is assigned a code, the encode and decode clocks show the same time. The only difference is that encode knows both PREF and CHAR, but decode knows only PREF and thus will not discover the CHAR field until after it receives the next code. (Moreover, if it is using the array + hash table data structure, it cannot add the pair to the hash table until it does so.) The KwKwK problem arises when the code received is the last code, whose CHAR field is not known. * D-DELAY: decode delays assigning a code to a string by one tick, when the next code reveals the CHAR field. Thus decode's string table is one tick behind decode's. The KwKwK problem arises when decode receives the code that is the next one to be assigned. There is a third way to handle the KwKwK problem: * E-DELAY: encode delays assigning a code to a string by one tick so that decode can assign the code at the same time on their clocks. However, both must avoid assigning more than one code to a string. For example, encode sends code(Kw) which leads to adding KwK; and then sends code(Kw) again (since KwK is not yet in the table), which could lead to decode adding KwK again. The KwKwK problem cannot ocur. H. When the -p WINDOW option creates a new table, it might not assign the same code to a string; that is, the old code may no longer be valid either for sending or as a prefix. I. Since compression followed by decompression should reproduce the original, the acid test for encode/decode is that % ./encode < FILE | ./decode | cmp - FILE does not produce output for any file FILE. In this command * encode reads bytes from FILE and compresses them into a sequence of codes that are written to its standard output; * decode reads this sequence of codes and decompresses them into a sequence of bytes that are written to its standard output; and * cmp reads this sequence of bytes and compares it to the contents of FILE, printing a message only if it finds a discrepancy (shorter, longer, or at least one character different). "od -bc" and "cat -v -e -t" are other useful commands for looking at files. J. It may be helpful to write scaffolding code to: * Verify that the string table is consistent; e.g., for each (PREF, CHAR) pair, PREF is a valid code or some other special value, CHAR is a valid character, and the pair can be found where it is stored. * After inserting something in the hash table, search for it immediately to verify that it can be found where it was inserted. (When pruning, these searches should take place AFTER pruning is complete.) * Dump to a file the contents of the string table, with each line containing a CODE, the corresponding (PREF,CHAR) pair, and the string to which it corresponds (with all character values written as integers, not chars, so that they are visible). You can use this function to compare the tables produced by encode and decode, or (by omitting CODE, PREF, and CHAR and then sorting) to compare which strings are in the table before and after pruning. K. Common problems: Q: What to do if LZW works for text files but not binary files: A: The primary difference is that the latter usually contain NUL characters (i.e., '\0'). Thus you know where to start looking. Q: What to do if LZW works in Stage 2 but fails in Stage 3: A: If the output from decode is too short but is otherwise correct, verify that decode received the last code from encode (i.e., verify that encode calls flushBits() before exiting). If the output from decode is completely garbled, verify that decode received the first code from encode correctly. Q: What to do if the output from decode is garbled for some input to encode: A: Dump the string tables from encode and from decode in a human-readable format. Are they the same, except possibly for the last code assigned? If not, shorten the file (using "head -c N" and binary search) until the tables would be the same if one more character were removed; and trace the flow of execution to see where encode and decode differ. Q: What to do if the output from encode with pruning is too long: A: Verify that encode is writing the correct number of bits after pruning (i.e., the number of bits reflects the size of the pruned table). Verify that every string in the pruned table can be found (otherwise it may be added a second time, which could lead to less compression and more pruning). L. In bash you can set STAGE by preceding the command by a local variable assignment like STAGE=2 % STAGE=2 ./encode < FILE | STAGE=2 ./decode | cmp - FILE or by using the built-in export command % export STAGE=2 A more permanent alternative is to set the value in main() by inserting the statement setenv ("STAGE", "2", 1); before the first call to putBits() or getBits(); this also requires #include at the start of the file. But do not forget to change the second string and recompile when you want to change the value of STAGE, or to remove the statement when you reach STAGE=3. CS-323-10/22/20 The Lempel-Ziv-Welch Algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Notation ~~~~~~~~ Each string W in the table is represented by a pair (C,K), where K is the last character in W and C is the location in the table of the prefix of W that contains all but the last character. By convention, the null string is at location EMPTY; and the components of the pair in location I are denoted by PREF(I) and CHAR(I). Compression ~~~~~~~~~~~ Initialize the string table to contain the pair (EMPTY,K) for each char K C = EMPTY While ((K = getchar()) != EOF) If the pair (C,K) is in the table Then Set C = code associated with the pair (C,K) in the table Else Output code C Insert the pair (C,K) into the table Set C = code associated with the pair (EMPTY,K) in the table Output code C (if C != EMPTY) Decompression ~~~~~~~~~~~~~ Initialize the string table to contain the pair (EMPTY,K) for each char K // Version where strings are inserted | // Version where strings are inserted // in table with UNKNOWN CHAR fields | // only after CHAR field is known | While ((newC = C = getcode()) != EOF) | oldC = EMPTY If CHAR(C) == UNKNOWN | While ((newC = C = getcode()) != EOF) Then | If C is an unknown code Push finalK onto Kstack | Then C = PREF(C) | Push finalK onto Kstack | C = oldC While PREF(C) != EMPTY | Push CHAR(C) onto Kstack | While PREF(C) != EMPTY C = PREF(C) | Push CHAR(C) onto Kstack | C = PREF(C) finalK = CHAR(C) | putchar(finalK) | finalK = CHAR(C) | putchar(finalK) While Kstack is nonempty | Pop K off Kstack | While Kstack is nonempty putchar(K) | Pop K off Kstack | putchar(K) If the CHAR field of the last code | assigned is UNKNOWN | If oldC != EMPTY Replace it by finalK | Insert the pair (oldC,finalK) Insert the pair (newC,UNKNOWN) | into the table into the table | oldC = newC Example (with EMPTY = 0) ~~~~~~~~~~~~~~~~~~~~~~~~ Message: a b a b c b a b a b a a a a a a a ... Encoding: 1 2 4 3 5 8 1 10 11 ... String Table: C PREF CHAR String C PREF CHAR String C PREF CHAR String ... 1 0 a a 5 2 a ba 9 8 a baba 2 0 b b 6 4 c abc 10 1 a aa 3 0 c c 7 3 b cb 11 10 a aaa 4 1 b ab 8 5 b bab 12 11 a aaaa CS-323-08/13/19