======================================== Notes for Lecture 25 - April 24, 2008 ======================================== -------------------------------------------------------------------- * Bytes, characters, and integers (review) ** Bytes are units of information storage ** Characters and integers are abstract objects represented by bytes ** C semantics are based on integers, not bytes To manipulate bytes in C, one must manipulate integers and understand the correspondence between integers and the bytes that represent them. ** Reading bytes from a file *** fgetc(instream) **** Semantics Returns an integer in the range [0...255] if the byte exists, -1 if end of file has been reached. The integer returned is the value of the byte, interpreted as a binary number. **** Correct usage int ch; ch = fgetc(instream); sets ch to the value of the next byte in the stream. **** Incorrect usage #1 char ch; ch = fgetc(instream); The problem here has to do with the assignment statement. fgetc() returns an integer as described above, but ch is a signed 8-bit integer variable (with a misleading type name) whose range of values is [-128...127]. Since numbers larger than 127 cannot be stored in ch, the assignment is undefined in those cases where fgetc() returns a number > 127. [Although it is undefined by the standard, what actually happens is that the bits get copied, and the result is a negative number.] **** Incorrect usage #2 unsigned char ch; ch = fgetc(instream); This correctly reads the 256 possible bytes, but when fgetc() returns -1, the number stored in ch is 255, the same as if a byte had been read from the file of value 255. **** Reading a byte into a variable of type unsigned char Suppose we want to end up with the next byte of the file stored in an unsigned char, but we don't want to lose the end-of-file indication. What can we do? ***** Method 1 unsigned char ch; int x; x = fgetc(instream); if (x != EOF) { ch = x; } Now, if control reaches the last assignment, the value of x is between 0 and 255. Since these are all within the range of ch, the assignment correctly copies the byte into ch. ***** Method 2 unsigned char ch; int n; n = fscanf( instream, "%c", &ch ); if (n != EOF) { ... // use ch } -------------------------------------------------------------------- * Flexible arrays in C99 ** Incomplete array types An array type specification is incomplete if it fails to specify the length of the array. Incomplete array types have always been allowed for function parameters. Example: A function to sum the first n elements of a double array A might be declared as double sum( int n, double A[] ); ** Flexible arrays A flexible array is a variable length structure. It is declared by declaring the last component to be an incomplete array type. Example: typedef struct flex { int size; double A[]; }* Flex; ** Semantics The compiler lays out the fields of the struct as if it were a regular structure, up to the incomplete array component (which must be the last field of the struct). It includes any padding that must be added before the last component to satisfy byte alignment requirements for that type (double in the above example). Example: Assuming doubles must be aligned on 8-byte boundaries and int's have size 4, then 4 padding bytes must be inserted between size and A[0]. sizeof(struct flex) counts these bytes but not the bytes that the double array will eventually occupy. Thus, sizeof(struct flex)=8. ** Allocating a flexible array Example: To create a Flex to contain n doubles, where n is a variable: Flex tbl; tbl = malloc( sizeof(struct flex) + n*sizeof(double) ); if (tbl == NULL) { ... // error } tbl->size = n; Now, doubles can be stored into tbl->A[j] for j in the range [0...n-1]. ** Using flexible arrays Example: A function to sum all of the elements of a Flex argument: double sum( Flex table ) { int n = table->size; int k; double total = 0.0; for (k=0; kA[k]; return total; } -------------------------------------------------------------------- * LZW data compression ** Algorithm In preparation for both encoding and decoding, create a code table (symtab) where the keys are byte strings and the values are numbers of a certain length (e.g., 12-bit). Initialize the table by assigning code words 0...255 to the 1-byte strings. The remaining code words are unassigned. *** Encoding algorithm 1. Read input file a byte at a time. Let w be the longest input prefix such that w and all of its prefixes are in the code table. 2. Look up w in the code table and output its 12-bit code word. 3. Let a be the next input byte following w. Assign the next unused code word to "wa" and put "wa" in the code table. If table is full (i.e., all 12-bit code words have been assigned), then skip this step but continue encoding. 4. Repeat steps 1-3 until the entire file is encoded. *** Decoding algorithm 1. Read the encoded file 12-bits at a time. Let c be the next code word, and let w be the byte string associated with c. 2. Write w to the output file. 3a Let c' be the next code word. There are two cases depending on whether c' has already been assigned a word in the code table. 3b. Case i) c' is in the table. Let x be the byte string associated with c'. Let a be the first byte of x. 3c. Case ii) c' is not in the table. Let a be the first byte of w. 3d. Assign the next unused code word to "wa" and put "wa" in the code table. If table is full, then skip this step but continue decoding. 4. Repeat steps 1-3 until the entire file is decoded. *** Why it works The decoding algorithm tracks the encoding algorithm. Even though the code table changes after each stage, the code table after stage k of decoding is identical to what it was after stage k of decoding. The tricky case is step 3 of the decoding algorithm. During encoding, the word "wa" was put in the code table, where "a" is the next (not-yet-processed) input byte. During decoding, "a" has not yet been decoded. In order to update the code table properly, we must somehow determine what the next byte of input is before we have decoded it. While this sounds paradoxical, we can break the problem into two cases, both of which can be handled. If the next 12-bit code word is already in the table, then we can look ahead in the encoded file, get the next word, and find it in the code table. The first byte of its associated byte string is the desired byte "a". If the next 12-bit code word is not yet in the table, we know it will be there at the completion of this stage, since during encoding it must have been in the table at the next stage. Therefore, the byte string that will be associated with it is the string "wa", where "a" is the byte that we are trying to determine. But if "wa" is the decoding of the next code word, then it's first byte is the first byte of "w"! This is the reason why step (3c) above works. ** Implementation issues *** Byte order Internally, each 12-bit code word is simply represented by a signed or unsigned integer of sufficient length (e.g., an int or short int). While these integers could be written to the output file as ordinary decimal numbers, that would be quite wasteful of space and would negate the hoped-for savings in file size. Instead, we want to represent each pair of code words by 3 bytes of the encoded file. But which three bytes? Two natural orders are possible. Example: Suppose the two code words to be written out are (in hex): 0x123 0xabc . **** Big endian In big endian order, integers are written out high-order bits first. The 3 code bytes would be: 0x12 0x3a 0xbc. **** Little endian In little endian order. integers are written out low-order bits first. The 3 code bytes would be: 0x23 0xc1 0xab. Either method is acceptable, but clearly the encoder and decoder must agree. *** Big-endian and little-endian machines Computers are big-endian or little-endian. This is noticable only if one looks at the bytes of a number, not through shifting and masking (which always gives the same result no matter what the flavor of the computer), but through type casting or unions. Example: union view { long unsigned int lo; unsigned char b[4]; } x; The 4 bytes of x (on a 32-bit machine) can be viewed in two ways: x.lo, which is a long unsigned int, and x.b[0], ..., x.b[3], which are the four bytes that comprise. If x.lon in hex is 0x12345678, then what is observed in b[] depends on the endianness of the machine. Big-endian machine: b[0] = 0x12, b[1] = 0x34, b[2] = 0x56, b[3] = 0x78. Little-endian machine: b[0] = 0x78, b[1] = 0x56, b[2] = 0x34, b[3] = 0x12. The demo program uses shifting to pack and unpack the bits, so it is correct on either kind of machine. *** Storage management Because a new string must be stored at each stage of encoding or decoding, there is a big performance advantage to using a string store rather than malloc() for allocating the storage for the byte strings. *** Efficiency -- big cost to maintaining sorted table Several different algorithms are possible for maintaining the code table. One method that gives good performance is a hash table since it provides fast lookups and also allows new entries to be inserted quickly. A simple-to-implement method stores the code table in an array, sorted on the byte strings. This can be searched quickly using binary search. However, inserting a new entry requires moving existing elements in the array to make room for the new elements. The demo program "compress-bin" uses this method and is considerably slower than the program "compress-hash" that uses a hash table. Another data structure that we've studied that could be used is a binary search tree. Here, searches are fast if the tree can be kept balanced but are slow if the tree gets badly out of balance. Using this for LZW compression is likely to cause unpredictable performance since particular data patterns might give rise to especially unbalanced trees (such as a file of all 0's, where the byte strings to be inserted in the code table would be "0", "00", "000", ... in that order and would result in a maximally unbalanced tree). Balanced binary search tree algorithms could be used but tend to be more complicated to code and have somewhat greater overhead in both their time and space requirements. ** Possible improvements One big weakenss of LZW encoding as I've presented it is that the table doesn't change once it's full. This is fine if the input file is uniform throughout, so that the strings that occur in the first part of the file occur throughout with more or less the same distribution. But if the file is not uniform in this way, poor compression might result on the parts of the file later on that do not look like the beginning part. I'll sketch out some possibilities for getting around this limitation, but I don't know how well any of them work or if they're used in practice. *** LRU One possible way around this problem is to not freeze the code table when it becomes full but instead to replace seldom-used entries by new ones. A natural strategy is LRU (least recently used). Here, we attach to each code word the "time" (the stage number) when that code word was last accessed. When it comes time to replace an entry, the one with the earlies last-access time is removed. (Note that we will want to maintain the property that for any word in the code table, all of its prefixes are also in the table, so we must arrange things that we will never remove a word that is the prefix of another word in the table.) To implement LRU efficiently is a considerable challenge. One strategy is to maintain a priority queue of code table entries prioritized on their time of last access. Whenever those times change, the priority queue must be updated. Doing this efficiently is a challenge. *** Fixed length code, but allow code words to grow in length Another possibility is to let the code words grow in length. For example, they could start out as 12-bit words, but when the table fills, the algorithm switches to using 13-bit words, and so forth. Of course, the amount of data compression will decrease as the code word length grows, so this "solution" is probably not a win. *** Use Huffman encoding on the LZW codewords Yet another possibility is to let the LZW code words grow as above (or be very long to begin with), and then use adaptive Huffman encoding on those code words. To make this work, we would keep track of the number of times each code word is output during the course of the LZW algorithm. These frequencies would then be used to drive Huffman encoding, so the most frequently used words would end up with the shortest bit strings in a Huffman code. The Huffman algorithm would have to be made adaptive so that the Huffman tree could be recomputed each time the frequency of any word changed. ** Demo See demos-25/1-lzw