YALE UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE

CPSC 223b: Data Structures and Programming Techniques Handout #16
Professor M. J. Fischer April 14, 2008



Problem Set 9

Due before midnight on Tuesday, April 22, 2008.

1 Prefix Codes

A set C of words has the prefix property if no word in the set is a proper prefix of another word in the set. Let C* be the set of all words composed of zero or more concatenations of copies of words in C. Then any word in C* can be uniquely parsed as a concatenation of words in C.

Let w0 ∈ C*. Here’s how to parse w0 into words of C. If w0 is the empty word, we are done. Otherwise, find the unique word v1 ∈ C that is a prefix of w0. Existence of v1 is because w0 ∈ C*. Uniqueness follows from the fact that C is a prefix code. Let w1 be the rest of w after v1 is removed. Thus, w0 = v1w1. It follows that w1 ∈ C*.

By induction, w1 can be uniquely parsed as w1 = v2vk for some k, where v2,,vk ∈ C. Then w = v1vk is the desired unique parse of w0.

Example Let C = {00,1,010,011}. No word of C is a proper prefix of any other, so this is a prefix code. Consider the word w = 01001011100011. The unique prefix of w in C is 010. Removing this leaves the word v1 = 01011100011. It too has 010 as a unique prefix. Proceeding in this manner, we parse w as w = 010|010|1|1|1|00|011.

A code is a one-to-one map E from a finite alphabet Σ to a set C of code words over some alphabet Γ. The code is a prefix code if C has the prefix property. A code E is readily extended to words over Σ by defining E(a1a2an) = E(a1)E(a2)E(an), where n 0 and a1,,an ∈ Σ.

Example (cont.) Continuing the example above, let Σ = {a,b,c,d} and let E be the map that takes a↦→00, b↦→1, c↦→010, and d↦→011. Then E(cabbad) = 010001100011.

Because of the unique parsability property of words in C*, the extended function E : Σ*C* is also one-to-one. Hence, there exists an inverse function D that maps words in the range of E to words over Σ. D has the property that D(E(w)) = w for any word w ∈ Σ*. Note that D is not defined for words in Γ*- C*.

2 File Compression Using Prefix Codes

Prefix codes can be used for file compression. We take Σ to be the set of 256 8-bit bytes, and Γ to be the set {0,1}. We apply the encoding function E to the bytes in the file to obtain the encoding of the file. This encoding, suitably represented as a sequence of bytes and written to a file, is the compressed version of the file.

To decompress the file, we need access to the same code that was used to compress it. Given the code, we read the bits of the encoded file, finding the shortest prefix z in the range of E. Let E(x) = z. Then x = D(z), so we output x. This process is repeated until the encoded file has been entirely read.

2.1 Representations of prefix codes

Different representations of the code are appropriate for encoding and decoding. For encoding, we use a code table. This is an array of bit strings, where the entry with index x is the code word E(x). Encoding then is simply a matter of indexing into this table for each byte of the file and writing out the corresponding bit strings.

For decoding, it is convenient to represent the code as a code tree. This is a strict binary tree whose leaves are labeled by plain text bytes. and whose edges are labeled by 0 or 1. The sequence of edge labels on the path from the root to a leaf labeled by x are the code word z = E(x). Given the encoded file, we walk walk down the tree, starting from the root, going to the left or right son at each step according to the successive bits of the encoded file. When a leaf is reached, it’s label is the byte corresponding to the prefix of the encoded file that has been read so far. That byte is then written out and the process repeated on the remainder of the file.

2.2 A sample implementation

The assignment directory /c/cs223/assignments/ps9 contains a complete implementation of an encoder and decoder for a general prefix code. The command pfxenc takes the names of three files as command line arguments: a code file, an input file, and an output file. It reads the input file, compresses it using the code table stored in the code file, and writes the result to the output file. The command pfxdec also takes the names of three files as command line arguments: a code file, an input file, and an output file. It reads the input file and decompresses it using the code table stored in the code file, and writes the result to the output file.

I have placed a sample code file in the assignment directory which you can use to try out the encoder and decoder commands. The file sample-code.pfx is in the format required by pfxenc and pfxdec. The file sample-code.txt gives the sample code in human-readable form.

2.3 File formats

Code file A code file consists of 256 blocks of bytes, where each block represents a code word. The first byte of the block is interpreted as the integer which is 1 less than the length of the code word (in bits). The remaining bytes of the block contain the bits of the code word, packed 8 bits per byte. If the length of the code word is not a multiple of 8, then the last byte will contain fewer than 8 bits. These bits are left-justified within the byte.

For example, the length-10 code word 0000 1110 11 would be represented as three bytes 0x09, 0x0e, 0xc0. The first byte, 0x09, is one less than the length of 10. The second byte, 0x0e, contains the first 8 bits of the code word. The third byte, 0xc0, contains the last two bits “11”, left-adjusted in the byte.

Compressed (encoded) file A compressed file consists of a single bit string, packed into bytes, with the bits from the last byte left-justified within the byte. A final byte is appended to the file giving the number of unused bits in the next-to-last byte. For example, the 14-bit string 1101 0111 0001 01 would be represented in the compressed file by 3 bytes: 0xd7, 0x14, 0x02. The first byte, 0xd7, contains the first 8 bits 1101 0111. The second byte, 0x14, contains the last 6 bits 0001 01, left-adjusted in the byte. The third byte, 0x02, is the number of unused bits in the second byte.

When decoding the compressed file, it is necessary to look ahead two bytes to know how to interpret the current byte. If the current byte is not the next-to-last byte of the file, then all 8 bits are valid. If it is the next-to-last byte, then the last byte tells one how many bits to discard. (Equivalently, 8 minus the last byte tells how many bits are valid.)

2.4 Program structure

The program consists of a number of files and modules. They are briefly described here. Look in the source code for additional information.

pfxenc.c
This is the main program for the pfxenc command. It reads the code file into a code table and uses the code table to encode the input.
pfxdec.c
This is the main program for the pfxdec command. It reads the code file into a code tree and uses the code tree to decode the input.
codetable
This module is used to represent and manipulate code tables. It also contains the function fwriteCodetable() which is the routine that writes a code table to a code file. (It’s not used by pfxenc and pfxdec, but it’s needed to generate code files.)
codetree
This module is used to represent and manipulate code trees. It contains functions to read a code tree from a file and to convert a code table to a code tree. It also contains functions to create Walker objects, which are tokens that can be used to walk down a tree.
bitstring
This module is used to represent and manipulate bit strings. A Bitstring object allows bits to be pushed and popped from the right end, and it contains support for reading and writing code words from the code file. It also contains iterators that allow a bit string to be scanned from left to right.
ibitstream
This module allows an open byte stream (of type FILE⋆) to be viewed as a padded bit stream. This is what is used by pfxdec to read the encoded file one bit at a time. To use it, first the file must be opened for reading with fopen(). Then an Ibitstream is created for the open file. The function getIbitstream() returns the next bit of input, or EOF if no further bits remain. To close an Ibitstream, first the stream should be closed with closeIbitstream(), and then the underlying file should be closed with fclose().
obitstream
This module allows an open output byte stream (of type FILE⋆) to be viewed as a padded bit stream. This is what is used by pfxenc to write the encoded file one bit at a time. To use it, first the file must be opened for writing with fopen(). Then an Obitstream is created for the open file. The function putbObitstream() writes a single bit to the stream. The function putsObitstream() writes an entire bit string to the stream. To close an Obitstream, first the stream should be closed with closeObitstream(), and then the underlying file should be closed with fclose(). Note that if the stream is not properly closed, some of the bits may get lost.
util
This is the familiar utility module containing fatal(), safe_malloc(), and safe_realloc().

3 Huffman Codes

A Huffman code is a prefix code that assigns short code words to frequently occurring symbols of a source file and longer code words to less frequent symbols so as to minimize the total length of the encoded file. A file encoded in this way will in general be shorter than the original file (not counting the space to record the code itself).

A Huffman tree is a strict binary tree built from a file of symbols. Each leaf is labeled with a symbol that occurs in the file. Each node also contains a weight. For a leaf labeled with symbol S, the weight is the number of occurrences of S in the file. For an internal node x, the weight is the total weight of the leaves in the subtree rooted at x. Thus, the weight of the root is the number of symbols in the file. The edges to its left and right sons of an internal node are labeled 0 and 1, respectively. Figure 1 gives an example of a Huffman tree for a file with five letters A, B, C, D, and E with frequencies 15, 9, 25, 14, 37, respectively. Of the many possible trees satisfying this description, the Huffman tree is constructed in a particular way to minimize the expected distance from root to a leaf, where the probability of a particular leaf is proportional to its frequency in the file.


PIC

Figure 1: A Huffman tree.


3.1 Constructing a Huffman tree

To create a Huffman tree, one first computes a frequency table for the number of occurrences of each symbol in the file. These frequencies become the initial weights of the symbols. Next, a node is created for each symbol that occurs in the file. Its weight is the number of occurrences of the symbol. These nodes will become the leaves of the constructed Huffman tree.

The algorithm for building the tree maintains a set of active nodes stored in a min-heap, that is, a heap in which the element of least weight is stored at the root. The algorithm then proceeds by repeatedly removing the two nodes of smallest weight from the heap (using deleteMinHeap()), combining them together into a new tree, and reinserting the root of that tree back into the heap. The construction stops when only a single node remains in the heap. The tree rooted at this node is the desired Huffman tree.

Here are the rules for combining two trees together into a new tree. Let A and B be trees with roots a and b, respectively. Create a new node c. Make a the left son of c and b the right son of c. Let the weight of c be the sum of the weights of a and b.

An example may make this construction clearer. Suppose one starts with five letters A, B, C, D, and E with frequencies 15, 9, 25, 14, 37, respectively. Each initial tree is a leaf node marked with a weight and a symbol (called its value). Table 1 shows the five initial nodes.


Table 1: Initial nodes.
 Tree  Weight V alue
--a-----15------A----

  b      9      B
  c     25      C
  d     14      D
  e     37      E


The two smallest nodes are b and d, so they are combined to give a subtree f of weight 23 which is then put back into the active set. At this point, the nodes in the active set are a, c, e, and f of weights 15, 25, 37, and 23, respectively. The two smallest of these are a and f, so they are combined to yield g. Continuing in this way results in the Huffman tree shown in Figure 1.

3.2 Extracting the prefix code

The sequence of edge labels leading from the root to a leaf labeled by symbol S is the code word corresponding to S. The code table for the tree of Figure 1 is given in Table 2. For example, the path to node a is 00, so the code word for A is 00.


Table 2: Code table.
 Sym bol  Code word
----A-------00------

    B       010
    C       10
    D       011
    E       11


To encode a file using this code table, one simply replaces each symbol by its code word.

For decoding, the tree is more convenient. To decode the next encoded symbol, one reads the encoded file bit by bit, walking down the tree from the root, going left or right at each step according to whether the next bit of the file is a 0 or a 1. When a leaf is encountered, the symbol labeling it is written out. This process is performed repeatedly until the entire file has been decoded.

4 Assignment

Your job is to implement Huffman file encoder and decoder commands hufenc and hufdec. These commands are similar to pfxenc and pfxdec, but with three important differences:

  1. The encoder hufenc generates a Huffman code from the file to be encoded and uses that code to encode the file rather than reading the code from a code file. It takes only two file names as command line arguments: the input and the output files.
  2. The encoder writes a serialization of the Huffman code tree to the output file before writing the encoded input file. Thus, a Huffman-encoded file consists of two parts: A representation of the Huffman code followed by the encoded input file.
  3. The decoder hufdec reads the code from the input file rather than from a separate code file. It takes only two file names as command line arguments: the input and the output files.

The Huffman code should be represented in the compressed file as the serialization of the Huffman code tree rather than using the format for a code file. This is for space efficiency. A full code table always contains 256 blocks of bytes, whereas a Huffman tree only contains leaves for the bytes that actually occur in the file.

5 Implementation Notes

  1. The encoder will read the clear text file twice: once to compute the number of times each symbol occurs and again to perform the actual encoding. The second pass is similar to what pfxenc does.
  2. You should use the rewind() command to go back to the beginning of the input file for the second pass rather than closing and reopening it. (This avoids possible problems of the file being modified or deleted between passes. It is also more efficient.)
  3. An encoded file consists of a sequence of bytes that represent a single bit stream with a padding count byte at the end as described in Section 2.3. The bit stream consists of two parts. The first part is a serialization of the Huffman tree (without weights, which are not needed once the tree is constructed). The second part is the sequence of code words that represent the original file. The second part follows immediately after the first part in the bit stream and does not necessarily begin on a byte boundary.
  4. The serialization of a tree is a string that represents the tree. The serialization of a Huffman tree that we use can be described recursively. A tree consisting of only a leaf marked with the symbol σ is represented by a 1 followed by the 8 bits of the symbol that is the value of the leaf—9 bits in all. A tree rooted at internal node c with two sons a and b is represented by a 0, followed by the representation of the tree rooted at a, followed by the representation of the tree rooted at b. Applying these rules to the subtrees rooted at the nodes of Figure 1 gives the encodings shown in Table 3.



    Table 3: Tree representations
     Node  B it string representation
-----------------------------
  a      1A
   b     1B
   c     1C
  d      1D
   e     1E
  f      01B1D
   g     01A01B1D

  h      01C1E
   i     001A01B1D01C1E


    Here, the letters A, B, C, D, and E stand for their 8-bit ASCII encodings, so the bit string 01000001 (the ASCII code for ‘A’) should be substituted for each occurrence of ‘A’ in Table 3 and similarly for ‘B’, ‘C’, ‘D’, and ‘E’. The serialization of the whole tree rooted at node i is thus

    0010100000101010000101010001000101000011101000101

  5. Huffman trees can be both read and written by very simple recursive routines. To serialize such a tree (i.e., to write it out as a bitstream), there are two cases depending what kind of a node the root is. If it’s an internal node, then write a 0, followed by a recursive write of the left subtree and then a recursive write of the right subtree. If it’s a leaf, then write a 1 followed by the 8-bits of the symbol stored at the leaf.

    To realize such a tree (i.e., to read it back in) is equally simple. Read the first bit. If it’s a 0, construct an internal node. Recursively read two subtrees and attach them as the two sons of the new node. If the first bit is a 1, construct a leaf. Read the next 8 bits, pack them together into a byte, and store that byte as the value of the leaf.

  6. The Huffman tree is just what you need for decoding, but it is not very convenient for encoding. After constructing the Huffman tree and writing it to the output file, the encoder should do a recursive tree walk and construct a code table similar to that used by pfxenc. This table consists of an array of 256 code words (suitably represented), indexed by a plain text byte.
  7. The code word corresponding to symbol σ is the sequence of edge labels that lead to the leaf marked by σ, as shown in Table 2. When performing a recursive tree walk, the problem you will face is to figure out, when you encounter a leaf, what the code word corresponding to that leaf is. One way to determine this would be to put a father link in each node. Then when you encountered a leaf, you could walk back up the tree to the root and figure out for each step whether the edge you traversed to get to your father was labeled by 0 or 1.

    A much simpler approach is to keep track of the path to each node you visit during the walk. You can do this by adding a path argument of type Bitstring to the recursive tree walk function walk. The precondition is that whenever walk is called at node x, the path argument contains the bit string that labels the path from the root to x.

    At the top level, you will call walk with the root and an empty path. Recursively, to walk an internal node, you will push 0 onto path and walk the left son. Then you will pop off the 0, push on a 1, and walk the right subtree. Remember to pop off the 1 before returning to your caller.

  8. There are two cases that require special treatment.
    1. An empty file contains no symbols. Hence, your frequency table will be all zeros, and the Huffman tree is degenerate. I suggest that you treat the empty file as a special case and represent it by the empty bit string (which in turn is represented by the 1-byte file 0x00 according to our padding conventions).
    2. A file consisting solely of one or more copies of a single symbol such as “AAAAAAA” will also present a problem. Since only one symbol has non-zero weight, the Huffman tree will consist of a single leaf marked by that symbol. But the path leading from root to leaf is the empty bit string, so any number of copies of that symbol will all be represented by the empty code word, which obviously does not allow the original file to be reconstructed. In this case, you should add a dummy second symbol of frequency 0 to the initial set of nodes before building the Huffman tree. This extra dummy node will cause the resulting tree to have a root and two leaves, which will avoid this degeneracy.
  9. In building the Huffman tree, you should use a heap to locate the node of least weight. I have put a copy of the heap module from demos-20 into the assignment directory for your use.
  10. When combining the two smallest trees during the Huffman tree construction, the code that results depends on the order in which the trees are combined. In order to get consistent results (and to make life easier for the grader), whenever two trees are combined, the trees should be combined in the order in which they were removed from the heap. That is, the first tree removed should become the left son of the new node and the second tree should become the right son. (Cf. Section 3.1.)

6 Deliverables

You should submit complete source code for your commands and a Makefile. You should also submit your test results showing what tests you performed on it and what the results of those tests were. As always, your tests should be as inclusive as possible, paying particular attention to the various special cases and edge conditions. Please give any sample Huffman-encoded files the suffix .hen so that they are easily distinguishable from the plain text files.