F I N A L S P E C I F I C A T I O N Due 2:00 AM, Friday, 23 January 2009 | CS-223 Homework #1 A Gentle Introduction to C (40) Write a filter "Csquash" that copies a C program from the standard input to the standard output, replacing each comment by a single space character and condensing whitespace. As discussed in class (and modified for the reasons explained in the asides | below), Csquash should | * Delete line splices (i.e., when finding a backslash immediately followed by | a newline, behave as if these characters never appeared). | Note: Splices do not nest and are recognized BEFORE any other processing | takes place. | Aside: Since the goal of Csquash is to reduce the length of the input file,| there is no reason to preserve line splices rather than delete them. | * Replace both /*...*/ and //... comments by a single space | character, which may then be condensed. | Note: /* ... */ comments do not nest, and //... comments do NOT | include the trailing newline. | * Condense each sequence of spaces (including those that replace comments) by | a single space and each sequence of newlines by a single newline. | Note: Other whitespace is just copied unchanged. | Aside: Condensing all whitespace into a single newline if it contains a | newline and a single space otherwise makes it too difficult to limit the | penalty for not implementing condensation. | * Not make ANY assumptions as to the maximum length of a line. | * Assume that the input does not contain any trigraphs (which does not require| any code). | * Assume that the input does not contain any preprocessing directives (which | does not require any code). | * Handle character constants and strings correctly. | * Handle escaped characters within character constants and strings correctly. | * Fail "gracefully" (i.e., neither go into an infinite loop nor cause a | memory dump) if the input is not a legitimate C program or any of these | assumptions is violated. | Moreover, although Csquash is a single program, its processing can be viewed as| taking place in three logically successive phases: | 1. Delete every line splice. | 2. Replace each comment by a single space character. | 3. Replace each sequence of space characters by a single space and each | sequence of newline characters by a single newline. | Thus if its input is a legitimate C program, its output is as well and has the | same meaning. | Use the submit command (see below) to turn in the source file(s) for Csquash, a Makefile, and your log file (see below). This assignment will be face-to-face graded; further details will be given in class and the yale.cs.cs223 newsgroup. YOU MUST SUBMIT YOUR FILES (INCLUDING THE LOG FILE) AT THE END OF ANY SESSION WHERE YOU SPEND AT LEAST ONE HOUR WRITING OR DEBUGGING CODE, AND AT LEAST ONCE EVERY FIVE HOURS DURING LONGER SESSIONS. (ALL SUBMISSIONS ARE RETAINED.) Notes: 1. See Kernighan and Ritchie, pp. 192-4 and pp. 228-9 for more information about comments, escaped characters in char constants and strings, trigraphs, and line splices. Excerpts are appended. 2. The public grading script will be /c/cs223/Hwk1/test.Csquash; my solution will be /c/cs223/Hwk1/Csquash. 3. Csquash is easier to write if you can peek at the next character in the standard input without reading it. The macro #define ungetchar(c) ungetc(c,stdin) // Unread char read from stdin is one way to provide this function. It allows your program to "unread" a character (but only one) after getchar() has returned one. The character unread will be the next returned by getchar(). The function exit() allows your program to stop immediately, without having | to terminate any surrounding loops or to return to main() from a function. | 4. Neither strings nor arrays nor structs will be useful when writing Csquash. Moreover, you may only read from stdin and write to stdout (i.e., you may | not use files). | 5. The submit program can be invoked in eight different ways: /c/cs223/bin/submit 1 Makefile Csquash.c util.c time.log submits the named source files as your solution to Homework #1; /c/cs223/bin/check 2 lists the files that you submitted for Homework #2; /c/cs223/bin/unsubmit 3 error.submit bogus.solution deletes the named files that you submitted previously for Homework #3 (which is useful if you accidentally submit the wrong file); /c/cs223/bin/makeit 4 Csquash runs "make" on the files that you submitted previously for Homework #4; /c/cs223/bin/testit 5 Csquash runs the public test script for Csquash using the files that you submitted previously for Homework #5; /c/cs223/bin/protect 6 Csquash.c time.log protects the named files that you submitted previously for Homework #6 (so they cannot be deleted accidentally); /c/cs223/bin/unprotect 7 util.c time.log unprotects the named files that you submitted previously for Homework #7 (so they can be deleted); and /c/cs223/bin/retrieve 8 util.c time.log retrieves copies of the named files that you submitted previously for Homework #8 (in case you accidentally delete your own copies). 6. Keep track of how you spend your time in completing this assignment. Your log file should be of the general form: ESTIMATE of time to complete assignment: 10 hours Start Time Date Time Spent Work completed ---- ----- ---- -------------- 1/11 10:15 0:45 Read assignment and description of C comments et al. in Kernighan and Ritchie 1/14 15:45 1:15 Specification is complete; sketched solution using a finite-state machine with look-ahead 1/17 09:00 2:20 Wrote the program and eliminated compile-time errors; code passes six tests 1/19 19:05 2:00 Discovered and corrected two logical errors; code now passes ten tests 1/20 20:22 1:35 Finished debugging; program passes public test script ---- 7:55 TOTAL time spent but MUST contain * your estimate of the time required (made prior to writing any code), * the total time you actually spent, and * a brief discussion of the major difficulties you encountered in developing and debugging the program (and there will always be some). This log will generally be worth 5-10% of the total grade. N.B. To facilitate analysis, the log file must be the only file submitted whose name contains the string "log" and the estimate (total) must be on the only line in that file that contains the string "ESTIMATE" ("TOTAL"). 7. Prudence (and a 5-point penalty for code that does not make) suggests that you run makeit ("makeit 1 Csquash") and/or testit ("testit 1 Csquash") after you have submitted the final copies of your source files. 8. Correct handling of line splices and condensing streams of spaces/newlines | will be worth at most 8 points each. The purpose of "at most" is to give | more flexibility in developing test scripts while allowing interactions | between these features. | 9. My solution is 100 lines of code (59 lines ignoring comments, blank lines, | and brace-only lines). If your solution looks to be much larger than this, | you should talk to one of the instructional staff about your approach. | CS-223-01/16/09 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Excerpts from Appendix A of Kernighan and Ritchie A2 Lexical Conventions A program consists of one or more translation units stored in files. It is translated in several phases, which are described in Section A12. The first phases do low-level lexical transformations, carry out directives introduced by lines beginning with the # character, and perform macro definition and expansion. When the preprocessing of Section A12 is complete, the program has been reduced to a sequence of tokens. A2.2 Comments The characters /* introduce a comment, which terminates with the characters */. Comments do not nest, and they do not occur within string or character literals. A12 Preprocessing A preprocessor performs macro substitution, conditional compilation, and inclusion of named files. Lines beginning with #, perhaps preceded by white space, communicate with this preprocessor. The syntax of these lines is independent of the rest of the language; they may appear anywhere and have effect that lasts (independent of scope) until the end of the translation unit. Line boundaries are significant; each line is analyzed individually (but see Section A12.2 for how to adjoin lines). To the preprocessor, a token is any language token, or a character sequence giving a file name as in the #include directive (Section A12.4); in addition, any character not otherwise defined is taken as a token. However, the effect of white space characters other than space and horizontal tab is undefined within preprocessor lines. Preprocessing itself takes place in several logically successive phases that may, in a particular implementation, be condensed. 1. First, trigraph sequences as described in Section A12.1 are replaced by their equivalents. Should the operating system environment require it, newline characters are introduced between the lines of the source file. 2. Each occurrence of a backslash character \ followed by a newline is deleted, thus splicing lines (Section A12.2). 3. The program is split into tokens separated by white-space characters; comments are replaced by a single space. Then preprocessing directives are obeyed, and macros (Sections A12.3-A12.10) are expanded. 4. Escape sequences in constants and string literals (Sections A2.5.2, A2.6) are replaced by their equivalents; then adjacent string literals are concatenated. CS-223-01/08/09