YALE UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
| CPSC 223b: Data Structures and Programming Techniques | Handout #16 | |
| Professor M. J. Fischer | April 14, 2008 | |
Problem Set 9
Due before midnight on Tuesday, April 22, 2008.
A set C of words has the prefix property if no word in the set is a proper prefix of another word in the set. Let C* be the set of all words composed of zero or more concatenations of copies of words in C. Then any word in C* can be uniquely parsed as a concatenation of words in C.
Let w0
C*. Here’s how to parse w0 into words of C. If w0 is the empty word, we are done.
Otherwise, find the unique word v1
C that is a prefix of w0. Existence of v1 is because w0
C*.
Uniqueness follows from the fact that C is a prefix code. Let w1 be the rest of w after v1 is removed. Thus,
w0 = v1w1. It follows that w1
C*.
By induction, w1 can be uniquely parsed as w1 = v2…vk for some k, where v2,…,vk
C. Then
w = v1…vk is the desired unique parse of w0.
Example Let C = {00,1,010,011}. No word of C is a proper prefix of any other, so this is a prefix code. Consider the word w = 01001011100011. The unique prefix of w in C is 010. Removing this leaves the word v1 = 01011100011. It too has 010 as a unique prefix. Proceeding in this manner, we parse w as w = 010|010|1|1|1|00|011.
A code is a one-to-one map E from a finite alphabet Σ to a set C of code words over some alphabet Γ.
The code is a prefix code if C has the prefix property. A code E is readily extended to words over Σ by
defining E(a1a2…an) = E(a1)E(a2)…E(an), where n ≥ 0 and a1,…,an
Σ.
Example (cont.)
Continuing the example above, let Σ = {a,b,c,d} and let E be the map that takes a
00, b
1,
c
010, and d
011. Then E(cabbad) = 010001100011.
Because of the unique parsability property of words in C*, the extended function E : Σ*→ C* is also
one-to-one. Hence, there exists an inverse function D that maps words in the range of E to words over Σ.
D has the property that D(E(w)) = w for any word w
Σ*. Note that D is not defined for words in
Γ*- C*.
Prefix codes can be used for file compression. We take Σ to be the set of 256 8-bit bytes, and Γ to be the set {0,1}. We apply the encoding function E to the bytes in the file to obtain the encoding of the file. This encoding, suitably represented as a sequence of bytes and written to a file, is the compressed version of the file.
To decompress the file, we need access to the same code that was used to compress it. Given the code, we read the bits of the encoded file, finding the shortest prefix z in the range of E. Let E(x) = z. Then x = D(z), so we output x. This process is repeated until the encoded file has been entirely read.
Different representations of the code are appropriate for encoding and decoding. For encoding, we use a code table. This is an array of bit strings, where the entry with index x is the code word E(x). Encoding then is simply a matter of indexing into this table for each byte of the file and writing out the corresponding bit strings.
For decoding, it is convenient to represent the code as a code tree. This is a strict binary tree whose leaves are labeled by plain text bytes. and whose edges are labeled by 0 or 1. The sequence of edge labels on the path from the root to a leaf labeled by x are the code word z = E(x). Given the encoded file, we walk walk down the tree, starting from the root, going to the left or right son at each step according to the successive bits of the encoded file. When a leaf is reached, it’s label is the byte corresponding to the prefix of the encoded file that has been read so far. That byte is then written out and the process repeated on the remainder of the file.
The assignment directory /c/cs223/assignments/ps9 contains a complete implementation of an encoder and decoder for a general prefix code. The command pfxenc takes the names of three files as command line arguments: a code file, an input file, and an output file. It reads the input file, compresses it using the code table stored in the code file, and writes the result to the output file. The command pfxdec also takes the names of three files as command line arguments: a code file, an input file, and an output file. It reads the input file and decompresses it using the code table stored in the code file, and writes the result to the output file.
I have placed a sample code file in the assignment directory which you can use to try out the encoder and decoder commands. The file sample-code.pfx is in the format required by pfxenc and pfxdec. The file sample-code.txt gives the sample code in human-readable form.
Code file A code file consists of 256 blocks of bytes, where each block represents a code word. The first byte of the block is interpreted as the integer which is 1 less than the length of the code word (in bits). The remaining bytes of the block contain the bits of the code word, packed 8 bits per byte. If the length of the code word is not a multiple of 8, then the last byte will contain fewer than 8 bits. These bits are left-justified within the byte.
For example, the length-10 code word 0000 1110 11 would be represented as three bytes 0x09, 0x0e, 0xc0. The first byte, 0x09, is one less than the length of 10. The second byte, 0x0e, contains the first 8 bits of the code word. The third byte, 0xc0, contains the last two bits “11”, left-adjusted in the byte.
Compressed (encoded) file A compressed file consists of a single bit string, packed into bytes, with the bits from the last byte left-justified within the byte. A final byte is appended to the file giving the number of unused bits in the next-to-last byte. For example, the 14-bit string 1101 0111 0001 01 would be represented in the compressed file by 3 bytes: 0xd7, 0x14, 0x02. The first byte, 0xd7, contains the first 8 bits 1101 0111. The second byte, 0x14, contains the last 6 bits 0001 01, left-adjusted in the byte. The third byte, 0x02, is the number of unused bits in the second byte.
When decoding the compressed file, it is necessary to look ahead two bytes to know how to interpret the current byte. If the current byte is not the next-to-last byte of the file, then all 8 bits are valid. If it is the next-to-last byte, then the last byte tells one how many bits to discard. (Equivalently, 8 minus the last byte tells how many bits are valid.)
The program consists of a number of files and modules. They are briefly described here. Look in the source code for additional information.
A Huffman code is a prefix code that assigns short code words to frequently occurring symbols of a source file and longer code words to less frequent symbols so as to minimize the total length of the encoded file. A file encoded in this way will in general be shorter than the original file (not counting the space to record the code itself).
A Huffman tree is a strict binary tree built from a file of symbols. Each leaf is labeled with a symbol that occurs in the file. Each node also contains a weight. For a leaf labeled with symbol S, the weight is the number of occurrences of S in the file. For an internal node x, the weight is the total weight of the leaves in the subtree rooted at x. Thus, the weight of the root is the number of symbols in the file. The edges to its left and right sons of an internal node are labeled 0 and 1, respectively. Figure 1 gives an example of a Huffman tree for a file with five letters A, B, C, D, and E with frequencies 15, 9, 25, 14, 37, respectively. Of the many possible trees satisfying this description, the Huffman tree is constructed in a particular way to minimize the expected distance from root to a leaf, where the probability of a particular leaf is proportional to its frequency in the file.
To create a Huffman tree, one first computes a frequency table for the number of occurrences of each symbol in the file. These frequencies become the initial weights of the symbols. Next, a node is created for each symbol that occurs in the file. Its weight is the number of occurrences of the symbol. These nodes will become the leaves of the constructed Huffman tree.
The algorithm for building the tree maintains a set of active nodes stored in a min-heap, that is, a heap in which the element of least weight is stored at the root. The algorithm then proceeds by repeatedly removing the two nodes of smallest weight from the heap (using deleteMinHeap()), combining them together into a new tree, and reinserting the root of that tree back into the heap. The construction stops when only a single node remains in the heap. The tree rooted at this node is the desired Huffman tree.
Here are the rules for combining two trees together into a new tree. Let A and B be trees with roots a and b, respectively. Create a new node c. Make a the left son of c and b the right son of c. Let the weight of c be the sum of the weights of a and b.
An example may make this construction clearer. Suppose one starts with five letters A, B, C, D, and E with frequencies 15, 9, 25, 14, 37, respectively. Each initial tree is a leaf node marked with a weight and a symbol (called its value). Table 1 shows the five initial nodes.
The two smallest nodes are b and d, so they are combined to give a subtree f of weight 23 which is then put back into the active set. At this point, the nodes in the active set are a, c, e, and f of weights 15, 25, 37, and 23, respectively. The two smallest of these are a and f, so they are combined to yield g. Continuing in this way results in the Huffman tree shown in Figure 1.
The sequence of edge labels leading from the root to a leaf labeled by symbol S is the code word corresponding to S. The code table for the tree of Figure 1 is given in Table 2. For example, the path to node a is 00, so the code word for A is 00.
To encode a file using this code table, one simply replaces each symbol by its code word.
For decoding, the tree is more convenient. To decode the next encoded symbol, one reads the encoded file bit by bit, walking down the tree from the root, going left or right at each step according to whether the next bit of the file is a 0 or a 1. When a leaf is encountered, the symbol labeling it is written out. This process is performed repeatedly until the entire file has been decoded.
Your job is to implement Huffman file encoder and decoder commands hufenc and hufdec. These commands are similar to pfxenc and pfxdec, but with three important differences:
The Huffman code should be represented in the compressed file as the serialization of the Huffman code tree rather than using the format for a code file. This is for space efficiency. A full code table always contains 256 blocks of bytes, whereas a Huffman tree only contains leaves for the bytes that actually occur in the file.
Here, the letters A, B, C, D, and E stand for their 8-bit ASCII encodings, so the bit string 01000001 (the ASCII code for ‘A’) should be substituted for each occurrence of ‘A’ in Table 3 and similarly for ‘B’, ‘C’, ‘D’, and ‘E’. The serialization of the whole tree rooted at node i is thus

To realize such a tree (i.e., to read it back in) is equally simple. Read the first bit. If it’s a 0, construct an internal node. Recursively read two subtrees and attach them as the two sons of the new node. If the first bit is a 1, construct a leaf. Read the next 8 bits, pack them together into a byte, and store that byte as the value of the leaf.
A much simpler approach is to keep track of the path to each node you visit during the walk. You can do this by adding a path argument of type Bitstring to the recursive tree walk function walk. The precondition is that whenever walk is called at node x, the path argument contains the bit string that labels the path from the root to x.
At the top level, you will call walk with the root and an empty path. Recursively, to walk an internal node, you will push 0 onto path and walk the left son. Then you will pop off the 0, push on a 1, and walk the right subtree. Remember to pop off the 1 before returning to your caller.
You should submit complete source code for your commands and a Makefile. You should also submit your test results showing what tests you performed on it and what the results of those tests were. As always, your tests should be as inclusive as possible, paying particular attention to the various special cases and edge conditions. Please give any sample Huffman-encoded files the suffix .hen so that they are easily distinguishable from the plain text files.