P R E L I M I N A R Y S P E C I F I C A T I O N Due 2:00 AM, Friday, 29 January 2016 CS-223 Homework #1 A Gentle Introduction to C REMINDER: Do not under any circumstances copy another student's code or give a copy of your code to another student. After discussing the assignment with another student, you may not take any written or electronic record away. Moreover, you must engage in a full hour of mind-numbing activity before you work on it again. Such discussions must be noted in your log file. (40) Write a filter "Count" that copies a C program from the standard input to the standard output, replacing each newline that ends a line of code by a string of the form " //nnn\n", where nnn is the number of lines of code at that point in the source file. In particular, Count should * Use the format " //%d\n" to print out the line numbers. * Count as lines of code only those containing non-whitespace (see isspace()). Note: Comments, braces ({ and }) and the keyword "else" are treated as whitespace. * Assume that the input does not contain any trigraphs (which does not require any code). * Ignore the effect of preprocessing directives (e.g., if BEGIN is #define-d as {, then a line containing only BEGIN is a code line, whereas one containing only { is not). This does not require any code. * Handle line splices correctly (i.e., when finding a backslash immediately followed by a newline in the input stream, copy these characters but behave as if they did not appear; note that splices do not nest and are recognized before any other processing takes place). ). * Handle both C (/*...*/) and C++ (//...) comments correctly (e.g., they do not nest, they are the equivalent of a single space character, and a C++ comment does not include the trailing newline). * Handle char constants and strings correctly. * Handle escaped characters within char constants and strings correctly. * Fail "gracefully" (i.e., neither go into an infinite loop nor cause a memory dump) if the input is not a legitimate C program or any of the assumptions above is violated. but need not * Allow multicharacter char constants. Moreover, Count should not * Make ANY assumptions as to the maximum length of a line. * Use any global variables. * Use any arrays or pointers. Use the submit command (see below) to turn in the source file(s) for Count, a Makefile, and your log file (see below). YOU MUST SUBMIT YOUR FILES (INCLUDING THE LOG FILE) AT THE END OF ANY SESSION WHERE YOU SPEND AT LEAST ONE-HALF HOUR WRITING OR DEBUGGING CODE, AND AT LEAST ONCE EVERY HOUR DURING LONGER SESSIONS. (All submissions are retained.) Notes ~~~~~ 1. See Kernighan and Ritchie, pp. 192-194 and pp. 228-229, for more information about comments, escaped characters in char constants and strings, trigraphs, and line splices. Excerpts are appended. 2. When available, the public grading script will be /c/cs223/Hwk1/test.Count (and my solution will be /c/cs223/Hwk1/Count). To run it, type % /c/cs223/Hwk1/test.Count (here % is the shell prompt). The script uses make to create Count. To run each test it redirects the test file (e.g., /c/cs223/Hwk1/Tests/t01.c for Test #01) to the standard input of Count and redirects the standard output to a temporary file. Then it compares this file with the expected output for that input (e.g., /c/cs223/Hwk1/Tests/t01.cs for Test #01). Your program passes the test only if the two files are identical. To run your program on the file for Test #01, type % ./Count < /c/cs223/Hwk1/Tests/t01.c To compare the output from your program with the expected output, type % ./Count < /c/cs223/Hwk1/Tests/t01.c | cmp - /c/cs223/Hwk1/Tests/t01.cs (cmp outputs the first character where the files differ) or % ./Count < /c/cs223/Hwk1/Tests/t01.c | diff - /c/cs223/Hwk1/Tests/t01.cs (diff outputs the lines where they differ but uses a looser definition for "identical") or % /c/cs223/Hwk1/test.Count 01 (you may specify more than one test here). If your output looks the same as what is expected, but your program still fails the test, there are probably some invisible characters in your output. To make all characters visible (except blanks), type % ./Count < /c/cs223/Hwk1/Tests/t01.c | cat -vet or % ./Count < /c/cs223/Hwk1/Tests/t01.c | od -bc 3. Keep track of how you spend your time in completing this assignment. Your log file should be of the general form (that below is fictitious): ESTIMATE of time to complete assignment: 10 hours Time Time Date Started Spent Work completed ---- ------- ---- -------------- 1/13 10:15pm 0:45 Read assignment and relevant material in K&R 1/16 4:45pm 1:15 Sketched solution using a finite-state machine with one-character look-ahead 1/19 9:00am 2:20 Wrote the program and eliminated compile-time errors; code passes eight tests 1/20 7:05pm 2:00 Discovered and corrected two logical errors; code now passes eleven tests 1/23 11:00am 1:35 Finished debugging; program passes all public tests ---- 7:55 TOTAL time spent I discussed my solution with: Peter Salovey, Ben Polak, Tamar Gendler, and Jonathan Holloway (and watched four episodes of The Simpsons). but MUST contain * your estimate of the time required (made prior to writing any code), * the total time you actually spent on the assignment, * the names of all others (but not members of the teaching staff) with whom you discussed the assignment for more than 10 minutes, and * a brief discussion (100 words MINIMUM) of the major conceptual and coding difficulties that you encountered in developing and debugging the program (and there will always be some). This log will generally be worth 5-10% of the total grade. N.B. To facilitate analysis, the log file MUST be the only file submitted whose name contains the string "log" and the estimate / total MUST be on the only line in that file that contains the string "ESTIMATE" / "TOTAL". 4. The submit program can be invoked in eight different ways: % /c/cs223/bin/submit 1 Makefile Count.c util.c time.log submits the named source files as your solution to Homework #1; % /c/cs223/bin/check 2 lists the files that you submitted for Homework #2; % /c/cs223/bin/unsubmit 3 error.submit bogus.solution deletes the named files that you submitted previously for Homework #3 (which is useful if you rename a file or accidentally submit the wrong one); % /c/cs223/bin/makeit 4 Count runs "make" on the files that you submitted previously for Homework #4; % /c/cs223/bin/testit 5 Count runs the public test script for Count using the files that you submitted previously for Homework #5; % /c/cs223/bin/protect 6 Count.c time.log protects the named files that you submitted previously for Homework #6 (so they cannot be deleted accidentally); % /c/cs223/bin/unprotect 7 util.c time.log unprotects the named files that you submitted previously for Homework #7 (so they can be deleted); and % /c/cs223/bin/retrieve 8 common.c time.log and % /c/cs223/bin/retrieve 8 -d"2016/01/21 20:00" util.c retrieve copies of the named files that you submitted previously for Homework #8 (in case you accidentally delete your own copies). The day and hour are optional and request the latest submission prior to that time (see the -d flag under "man co" for how to specify times). 5. When assignments are style graded, EVERY source file found in the submit directory will be reviewed. Thus prudence suggests using unsubmit to remove a file from the directory when you change its name or it ceases to be part of your solution, 6. Prudence (and a 5-point penalty for code that does not make) suggests that you run makeit ("makeit 1 Count") after you have submitted the final version of your source files. Better yet, run testit ("testit 1 Count"). 7. Count is easier to write if you can peek at the next character in the standard input without reading it. The macro #define ungetchar(c) ungetc(c,stdin) // Unread char read from stdin allows you to push a character back onto the standard input. That is, the character C "unread" will be the next character returned by getchar(). The value returned by ungetchar() is its argument, or EOF if the operation was unsuccessful. Note: Every ungetchar() must be preceded by a getchar(), and you can only do one ungetchar() between successive getchar()'s. You may find this macro useful in writing Count since it allows you to read the next character and then decide that you should not have. Example: while ((c = getchar()) != EOF) { if (c == 'C') { c = getchar(); if (c == 'S') printf ("I found a CS in the standard input!\n"); else ungetchar(c); } } 8. The function exit() allows your program to stop immediately, without having to terminate any surrounding loops or to return to main() from a function. (To use it you must #include the header file .) 9. Count reads from stdin and writes to stdout but does no other input/output. A. Features of C99 (but not ANSI C) that may be useful: * The characters // begin a comment that ends at the end of the line. * Variable declarations can appear anywhere within a code block; no longer must variables be defined at the top of a code block or outside all functions. * The header file stdbool.h defines type bool (meaning boolean) and symbolic constants true and false. (To use it you must #include .) * Functions must declare a return value; no longer does the type default to int if no type is specified. B. Hints: * You do not need any language features (e.g., "variable-length arrays" or strings or pointers or structs) not described in K&R, Chapter 1. However, you may want to use enum instead of #define to define symbolic constants. * Handle line splices in ONE place (e.g., BEFORE testing for anything else). C. Correct handling of line splices will be worth at most 10 points. Correct handling of "else" will be worth 2 points. D. Reading: Kernighan & Ritchie, Chapter 1 (introduction to C) Summit: https://www.eskimo.com/~scs/cclass/krnotes/sx4.html (K&R 1) Kernighan & Pike, Chapter 1 (style), Chapter 5 (debugging) Matthew & Stones, pp. 377-387 (makefiles), 429-445 (debugging) Matloff: http://heather.cs.ucdavis.edu/~matloff/Debug/Debug.pdf Optional: Aspnes: http://zoo.cs.yale.edu/classes/cs223/doc/howto.html Matthew & Stones, pp. 1-8 (Linux), 17-23 (bash) E. My solution is 130 lines long (83 lines ignoring comments, blank lines, brace-only, and else-only lines, of which 15 were needed to handle else). If your solution looks to be much larger than this, you should talk to one of the instructional staff about your approach. CS-223-01/20/16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Excerpts from Appendix A of Kernighan and Ritchie A2 Lexical Conventions A program consists of one or more translation units stored in files. It is translated in several phases, which are described in Section A12. The first phases do low-level lexical transformations, carry out directives introduced by lines beginning with the # character, and perform macro definition and expansion. When the preprocessing of Section A12 is complete, the program has been reduced to a sequence of tokens. A2.2 Comments The characters /* introduce a comment, which terminates with the characters */. Comments do not nest, and they do not occur within string or character literals. A12 Preprocessing A preprocessor performs macro substitution, conditional compilation, and inclusion of named files. Lines beginning with #, perhaps preceded by white space, communicate with this preprocessor. The syntax of these lines is independent of the rest of the language; they may appear anywhere and have effect that lasts (independent of scope) until the end of the translation unit. Line boundaries are significant; each line is analyzed individually (but see Section A12.2 for how to adjoin lines). To the preprocessor, a token is any language token, or a character sequence giving a file name as in the #include directive (Section A12.4); in addition, any character not otherwise defined is taken as a token. However, the effect of white space characters other than space and horizontal tab is undefined within preprocessor lines. Preprocessing itself takes place in several logically successive phases that may, in a particular implementation, be condensed. 1. First, trigraph sequences as described in Section A12.1 are replaced by their equivalents. Should the operating system environment require it, newline characters are introduced between the lines of the source file. 2. Each occurrence of a backslash character \ followed by a newline is deleted, thus splicing lines (Section A12.2). 3. The program is split into tokens separated by white-space characters; comments are replaced by a single space. Then preprocessing directives are obeyed, and macros (Sections A12.3-A12.10) are expanded. 4. Escape sequences in constants and string literals (Sections A2.5.2, A2.6) are replaced by their equivalents; then adjacent string literals are concatenated. CS-223-01/08/09