======================================== Notes for Lecture 25 - April 26, 2007 ======================================== * Huffman codes and data compression ** Theory of Huffman codes *** Intuition: use short code words for frequently occurring letters *** Two approaches to building Huffman code: **** Based on probabilities Assumes clear text file is randomly selected according to a known probability distribution on symbols. Code depends only on probabilities. Works well "on average" if files really come from the assumed distribution. **** Based on actual letter frequency Code is based on actual letter frequencies in a particular clear text file. A new code is computed for each clear text file. Is best prefix code for that file. Downside: The code tree must be stored in the encoded file, increasing its length and negating some of the savings from having an optimal code. ** Huffman algorithm *** Read file twice First pass, compute letter frequencies. Second pass, build Huffman tree. Code defined only for bytes that actually occur in the clear text file. *** How to build tree Start with a leaf node for each byte that occurs. Weight of each leaf node is the number of times the symbol occurs. Put all leaves into a priority queue (e.g., heap). This allows smallest (least frequent) node to be extracted quickly. Repeatedly remove two smallest nodes. Make them the two sons of a new internal node. Set weight of new node to sum of weights of two sons. Reinsert the new node into the priority queue. Stop when all nodes have been combined into a single tree. ** Serialization *** Problem A Huffman-compressed file contain a description of the Huffman code followed by the sequence of encodings of the bytes of the original file. The Huffman code is represented by a tree. We need a way to describe the tree by a linear sequence of symbols. The process of finding such a linearization is called serialization. *** Why doesn't it work to simply dump memory? Pointers refer to memory addresses in current address space. When the structure is recreated later, the nodes will in general be in different memory locations, so the pointers will be different. Serialization is any method for defining the structure in a way that will let an "equivalent" structure be recreated later, by another program. *** Recursive definition of serialization of Huffman tree Let ser(n) be the serialization (bitstring form) of tree rooted at n. If n is a leaf: ser(n) = 1A, where A is 8-bit ASCII representation of the symbol labeling n (value of n). If n is an internal node: ser(n) = 0 ser(n->son[0]) ser(n->son[1]). ** Realization This is the reverse of serialization: create structured data from the serialization. *** Recursive method for realizing Huffman tree Read first bit. If it's a 1, read 8 more bits and build a leaf. If it's a 0, recursively read two trees a and b, and build a new tree with a and b as the two sons. ** Special cases for Huffman encoding *** Empty clear text file. **** Problem Frequency of all letters is 0. Huffman tree (if one is built) has no nodes. Serialization of Huffman tree (by above definition) would be the empty bit string. Encoding of file would be the empty bit string. However, the algorithm for realizing a tree does not work since there is no way for it to know not to read any bits. **** Solution Simply define the encoding of the empty clear text file to be the empty bit string and dispense with the code tree altogether. *** Only one symbol occurs in the clear text file, possibly many times. **** Problem Huffman tree (if one is built) has only one node, so codeword of that node is the empty bit sequence, and the encoding of the file is the empty bit sequence, no matter how many times the symbol occurs in the file. This is not an acceptable representation method since the clear text file cannot be uniquely reconstructed. **** Solution Put a dummy node of frequency 0 into the Huffman tree construction. This will force the Huffman tree to have a root and two sons -- son[0] will be the dummy node; son[1] the real node, so the encoding of the one symbol that occurs in the file will be the bit string 1. ** Complete Huffman file-encoding algorithm Read clear text file twice. On first pass, build frequency table. Build Huffman code tree. Serialize it to the output file. Build a code table from it. Rewind input file (using rewind()). On second pass, encode using code table as with any prefix code. ** File decoding algorithm Realize code tree, handling special cases as appropriate. Decode remainder of encoded input file as with any prefix code. * LZW data compression ** Algorithm In preparation for both encoding and decoding, create a code table (symtab) where the keys are byte strings and the values are numbers of a certain length (e.g., 12-bit). Initialize the table by assigning code words 0...255 to the 1-byte strings. The remaining code words are unassigned. *** Encoding algorithm 1. Read input file a byte at a time. Let w be the longest input prefix such that w and all of its prefixes are in the code table. 2. Look up w in the code table and output its 12-bit code word. 3. Let a be the next input byte following w. Assign the next unused code word to "wa" and put "wa" in the code table. If table is full (i.e., all 12-bit code words have been assigned), then skip this step but continue encoding. 4. Repeat steps 1-3 until the entire file is encoded. *** Decoding algorithm 1. Read the encoded file 12-bits at a time. Let c be the next code word, and let w be the byte string associated with c. 2. Write w to the output file. 3a Let c' be the next code word. There are two cases depending on whether c' has already been assigned a word in the code table. 3b. Case i) c' is in the table. Let x be the byte string associated with c'. Let a be the first byte of x. 3c. Case ii) c' is not in the table. Let a be the first byte of w. 3d. Assign the next unused code word to "wa" and put "wa" in the code table. If table is full, then skip this step but continue decoding. 4. Repeat steps 1-3 until the entire file is decoded. *** Why it works The decoding algorithm tracks the encoding algorithm. Even though the code table changes after each stage, the code table after stage k of decoding is identical to what it was after stage k of decoding. The tricky case is step 3 of the decoding algorithm. During encoding, the word "wa" was put in the code table, where "a" is the next (not-yet-processed) input byte. During decoding, "a" has not yet been decoded. In order to update the code table properly, we must somehow determine what the next byte of input is before we have decoded it. While this sounds paradoxical, we can break the problem into two cases, both of which can be handled. If the next 12-bit code word is already in the table, then we can look ahead in the encoded file, get the next word, and find it in the code table. The first byte of its associated byte string is the desired byte "a". If the next 12-bit code word is not yet in the table, we know it will be there at the completion of this stage, since during encoding it must have been in the table at the next stage. Therefore, the byte string that will be associated with it is the string "wa", where "a" is the byte that we are trying to determine. But if "wa" is the decoding of the next code word, then it's first byte is the first byte of "w"! This is the reason why step (3c) above works. ** Implementation issues *** Byte order Internally, each 12-bit code word is simply represented by a signed or unsigned integer of sufficient length (e.g., an int or short int). While these integers could be written to the output file as ordinary decimal numbers, that would be quite wasteful of space and would negate the hoped-for savings in file size. Instead, we want to represent each pair of code words by 3 bytes of the encoded file. But which three bytes? Two natural orders are possible. Example: Suppose the two code words to be written out are (in hex): 0x123 0xabc . **** Big endian In big endian order, integers are written out high-order bits first. The 3 code bytes would be: 0x12 0x3a 0xbc. **** Little endian In little endian order. integers are written out low-order bits first. The 3 code bytes would be: 0x23 0xc1 0xab. Either method is acceptable, but clearly the encoder and decoder must agree. *** Big-endian and little-endian machines Computers are big-endian or little-endian. This is noticable only if one looks at the bytes of a number, not through shifting and masking (which always gives the same result no matter what the flavor of the computer), but through type casting or unions. Example: union view { long unsigned int lo; unsigned char b[4]; } x; The 4 bytes of x (on a 32-bit machine) can be viewed in two ways: x.lo, which is a long unsigned int, and x.b[0], ..., x.b[3], which are the four bytes that comprise. If x.lon in hex is 0x12345678, then what is observed in b[] depends on the endianness of the machine. Big-endian machine: b[0] = 0x12, b[1] = 0x34, b[2] = 0x56, b[3] = 0x78. Little-endian machine: b[0] = 0x78, b[1] = 0x56, b[2] = 0x34, b[3] = 0x12. The demo program uses shifting to pack and unpack the bits, so it is correct on either kind of machine. *** Storage management Because a new string must be stored at each stage of encoding or decoding, there is a big performance advantage to using a string store rather than malloc() for allocating the storage for the byte strings. *** Efficiency -- big cost to maintaining sorted table Several different algorithms are possible for maintaining the code table. One method that gives good performance is a hash table since it provides fast lookups and also allows new entries to be inserted quickly. A simple-to-implement method stores the code table in an array, sorted on the byte strings. This can be searched quickly using binary search. However, inserting a new entry requires moving existing elements in the array to make room for the new elements. The demo program "compress-bin" uses this method and is considerably slower than the program "compress-hash" that uses a hash table. Another data structure that we've studied that could be used is a binary search tree. Here, searches are fast if the tree can be kept balanced but are slow if the tree gets badly out of balance. Using this for LZW compression is likely to cause unpredictable performance since particular data patterns might give rise to especially unbalanced trees (such as a file of all 0's, where the byte strings to be inserted in the code table would be "0", "00", "000", ... in that order and would result in a maximally unbalanced tree). Balanced binary search tree algorithms could be used but tend to be more complicated to code and have somewhat greater overhead in both their time and space requirements. ** Possible improvements One big weakenss of LZW encoding as I've presented it is that the table doesn't change once it's full. This is fine if the input file is uniform throughout, so that the strings that occur in the first part of the file occur throughout with more or less the same distribution. But if the file is not uniform in this way, poor compression might result on the parts of the file later on that do not look like the beginning part. I'll sketch out some possibilities for getting around this limitation, but I don't know how well any of them work or if they're used in practice. *** LRU One possible way around this problem is to not freeze the code table when it becomes full but instead to replace seldom-used entries by new ones. A natural strategy is LRU (least recently used). Here, we attach to each code word the "time" (the stage number) when that code word was last accessed. When it comes time to replace an entry, the one with the earlies last-access time is removed. (Note that we will want to maintain the property that for any word in the code table, all of its prefixes are also in the table, so we must arrange things that we will never remove a word that is the prefix of another word in the table.) To implement LRU efficiently is a considerable challenge. One strategy is to maintain a priority queue of code table entries prioritized on their time of last access. Whenever those times change, the priority queue must be updated. Doing this efficiently is a challenge. *** Fixed length code, but allow code words to grow in length Another possibility is to let the code words grow in length. For example, they could start out as 12-bit words, but when the table fills, the algorithm switches to using 13-bit words, and so forth. Of course, the amount of data compression will decrease as the code word length grows, so this "solution" is probably not a win. *** Use Huffman encoding on the LZW codewords Yet another possibility is to let the LZW code words grow as above (or be very long to begin with), and then use adaptive Huffman encoding on those code words. To make this work, we would keep track of the number of times each code word is output during the course of the LZW algorithm. These frequencies would then be used to drive Huffman encoding, so the most frequently used words would end up with the shortest bit strings in a Huffman code. The Huffman algorithm would have to be made adaptive so that the Huffman tree could be recomputed each time the frequency of any word changed. ** Demo programs (see demos/demo_25/25.1_lzw # Local Variables: # mode: outline # End: