YALE UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
| CPSC 223b: Data Structures and Programming Techniques | Handout #6 (v. 4) | |
| Professor M. J. Fischer | January 29, 2008 | |
Problem Set 3
Due before midnight on Tuesday, February 5, 2008.
A file is a sequence of bytes. What those bytes represent and how they are to be interpreted is application dependent.
The command od is like a microscope for examining a file and looking directly at the bytes that comprise it. It is a useful tool for finding out what is “really” there as opposed to what the application that created the file is willing to let you see. Files often contain hidden data, which a tool like od can expose.
The name od is for “octal dump” and derives from the day when characters were 6 bits long and conventionally represented in octal (base 8). However, the od command now has many options and can dump a file in hex (base 16), which is convenient for representing 8-bit bytes. (See handout 5 on Bits and Bytes.)
The command od -v -Ad -tx1 file reads the specified file a byte at a time and prints out each byte as a hex number (without the leading “0x”). For example, if the file “hello.txt” contains the single line “hello!” with its terminating new line character, then od -v -Ad -tx1 hello.txt gives
0000000 68 65 6c 6c 6f 21 0a
0000007 |
The first number on each line is a decimal byte position. The remaining numbers are byte values in hex, each representing a single byte of the file. The byte position is the position in the file corresponding to the first byte value on the line. Bytes are numbered starting with zero, so the first line of output always has byte position 0. The last line contains a byte position but no byte values. The byte position is the position of the next byte of the file if there were one. Equivalently it’s the length of the file.
In the above example, the first byte, 0x68, is the code for the character ‘h’. The last byte, 0x0a, is the code for the new line character. The file itself contains 7 characters in positions 0 through 6, so the first “unused” position in the file is 7, which is the same as the file’s length.
The -Ad switch to od causes the byte position to be represented in decimal. The -v switch disables line suppression (which makes the output simpler, though less concise). The od command has many options. Type man od or info od for detailed information on what they do.
If we change -tx1 to -tx2 in the above example, then the file is read two bytes at a time. Each pair of bytes is interpreted as a short unsigned int and printed out as a 4-digit hex number. Thus, od -v -Ad -tx2 hello.txt gives
0000000 6568 6c6c 216f 000a
0000007 |
Notice that an extra zero byte was placed at the end of the file. However, the file length in the last line is 7, not 8, so we know that the apparent 8th byte of the file should be ignored. Notice also the different order that the bytes appear. This is because this example was produced on a little-endian machine (like the Zoo machines). The first byte 0x68 becomes the lower-order byte of the 2-byte integer and the second byte 0x65 becomes the high-order byte. Thus, we have
| 0x65 | 0x68 |
| 01100101 | 01101000 |
Your job is to write a program unodx1 to perform the inverse of od -v -Ad -tx1. Suppose file foo.dump1 is the output of od, produced from foo by
od -v -Ad -tx1 foo > foo.dump1
|
Then
unodx1 < foo.dump1 > foo_COPY
|
should produce a new file foo_COPY that is identical to the original file foo. Thus, unodx1 undoes what od -tx1 … does. You can use the cmp (compare) command to test if two files are identical or not. Your program should read from standard input and write to standard output. The shell input and output redirection characters, ‘<’ and ‘>’ respectively, can be used to redirect standard input and standard output to files as illustrated above.
Your program should check its input file for validity and reject invalid files with an appropriate error comment. Defining what constitutes an invalid file is not so easy. We say a file is strictly correct if it could be the output of od -v -Ad -tx1 on the Zoo machines. A file is valid if it is strictly correct or is a minor variant of a strictly correct file, as defined below. All other files are invalid.
A strictly correct file has three kinds of lines. A full line contains a decimal byte position followed by 16 2-digit hex numbers. A partial line contains a decimal byte position followed by at least one but fewer than 16 2-digit hex numbers. A terminal line contains only a decimal number which is the total file length. A strictly correct file consists of zero or more full lines, followed by zero or one partial line, followed by a terminal line. The byte position of the first line is 0, and for every full or partial line, the byte position of the following line is equal to the byte position of the current line plus the number of hex numbers on the current line.
A valid file is one that is similar in form to a strictly correct file except for the following:
Your program should accept all strictly correct files and work correctly on them. It should reject all invalid files. Valid files that are not strictly correct may be accepted by your program or rejected with an error comment as you see fit, but they should never cause your program to go into an infinite loop. If your program accepts a valid but not strictly correct file, it should produce “reasonable” output. (For example, if a hex byte code is more than 8 bits long, then its reasonable to output only the 8 low-order bits.)
You may use the function fatal() defined in util.h to report errors. The arguments to fatal() are the same as for printf(): a format string followed by data values corresponding to any data-conversion specifications in the format. fatal() prints its error comment to stderr followed by a new line and then terminates the program.
To use fatal() in your program, you will need to #include util.h in your source file. You will also need to modify your Makefile to compile util.c and to link in util.o with your object code. The files util.c and util.h are located on the Zoo in /c/cs223/assignments/ps3.
For extra credit, write another program unodx2 to perform the inverse of od -v -Ad -tx2. Do not try this until you have unodx1 finished. It’s not completely straightforward and requires use of union or pointer casts to handle the byte order problem, and it requires that you implement some sort of read-ahead scheme for the input or buffering of the output to prevent the last byte of each line from being written out until you know for sure that it belongs in the file and isn’t just a filler byte.
You can use scanf() with format specifier %hhx to read hex number into an unsigned char, and %hx to read a hex number into a short unsigned int.
To do line-at-a-time processing, you can read a line into a buffer using scanf() with format specifier %num[^\n], where num is a decimal number that is one less than the size of the buffer. Thus, if the line buffer is size 100, then the format would be %99[^\n]. This reads until a new line character is encountered or until a total of 99 characters have been read. It then puts a null character '\0' into the buffer.
Once the line is in a buffer, you can read it and convert numbers from it by using sscanf(). sscanf() takes a string as its first argument but otherwise works just like scanf(). When using multiple calls on sscanf() to process a single input line, you need to know how much of the string each call has read. This can be obtained by putting a %n format specifier at the end of the format. %n doesn’t read any characters but instead stores the number of characters read so far through the last argument, which must be a pointer to int.
I will supply a files mystery.dump1 and mystery.dump2 which, when decoded by unodx1 and unodx2, respectively, should give intelligible output. You should prepare a set of test files for each program to exercise the various special cases that your program might encounter, for example, a file that has no partial line, a file with a partial line, a file of length 0, a file of odd length, a file of even length, and so forth. You should also include in your test suite invalid files to trigger each of your error messages and files that contain as many different kinds of errors as you can think of, e.g., truncated files, files with garbage on the lines, files with too many numbers on a line, files with improper byte positions, and so forth. You should use the suffix .dump1 on the names of test data files intended as input to unodx1. If you do the extra credit part, you should similarly use .dump2 for files intended for unodx2.
You should name your source code file unodx1.c. You should provide a Makefile that will build unodx1 in response to either make or make all and that will delete the .o and executable files in response to make clean. You should submit your source files, Makefile, test input files, and the output produced by your program on your test data.
If you do the extra credit part, your Makefile should build both unodx1 and unodx2 in response to make or make all. Your submission should then of course also contain the necessary source code, test files, and output for the extra credit part.