P R E L I M I N A R Y    S P E C I F I C A T I O N


                                          Due 2:00 AM, Friday, 29 January 2016

CS-223   Homework #1    A Gentle Introduction to C

REMINDER:  Do not under any circumstances copy another student's code or give a
copy of your code to another student.  After discussing the assignment with
another student, you may not take any written or electronic record away.
Moreover, you must engage in a full hour of mind-numbing activity before you
work on it again.  Such discussions must be noted in your log file.


(40) Write a filter "Count" that copies a C program from the standard input
to the standard output, replacing each newline that ends a line of code by a
string of the form " //nnn\n", where nnn is the number of lines of code at
that point in the source file.

In particular, Count should

* Use the format " //%d\n" to print out the line numbers.

* Count as lines of code only those containing non-whitespace (see isspace()).
  Note:  Comments, braces ({ and }) and the keyword "else" are treated as
  whitespace.

* Assume that the input does not contain any trigraphs (which does not require
  any code).

* Ignore the effect of preprocessing directives (e.g., if BEGIN is #define-d as
  {, then a line containing only BEGIN is a code line, whereas one containing
  only { is not).  This does not require any code.

* Handle line splices correctly (i.e., when finding a backslash immediately
  followed by a newline in the input stream, copy these characters but behave
  as if they did not appear; note that splices do not nest and are recognized
  before any other processing takes place).
                                                          ).
* Handle both C (/*...*/) and C++ (//...) comments correctly (e.g., they do not
  nest, they are the equivalent of a single space character, and a C++ comment
  does not include the trailing newline).

* Handle char constants and strings correctly.

* Handle escaped characters within char constants and strings correctly.

* Fail "gracefully" (i.e., neither go into an infinite loop nor cause a memory
  dump) if the input is not a legitimate C program or any of the assumptions
  above is violated.

but need not

* Allow multicharacter char constants.

Moreover, Count should not

* Make ANY assumptions as to the maximum length of a line.

* Use any global variables.

* Use any arrays or pointers.

Use the submit command (see below) to turn in the source file(s) for Count, a
Makefile, and your log file (see below).

YOU MUST SUBMIT YOUR FILES (INCLUDING THE LOG FILE) AT THE END OF ANY SESSION
WHERE YOU SPEND AT LEAST ONE-HALF HOUR WRITING OR DEBUGGING CODE, AND AT LEAST
ONCE EVERY HOUR DURING LONGER SESSIONS.  (All submissions are retained.)

Notes
~~~~~
1. See Kernighan and Ritchie, pp. 192-194 and pp. 228-229, for more information
   about comments, escaped characters in char constants and strings, trigraphs,
   and line splices.  Excerpts are appended.

2. When available, the public grading script will be /c/cs223/Hwk1/test.Count
   (and my solution will be /c/cs223/Hwk1/Count).  To run it, type

     % /c/cs223/Hwk1/test.Count

   (here % is the shell prompt).  The script uses make to create Count.  To
   run each test it redirects the test file (e.g., /c/cs223/Hwk1/Tests/t01.c
   for Test #01) to the standard input of Count and redirects the standard
   output to a temporary file.  Then it compares this file with the expected
   output for that input (e.g., /c/cs223/Hwk1/Tests/t01.cs for Test #01).
   Your program passes the test only if the two files are identical.

   To run your program on the file for Test #01, type

     % ./Count < /c/cs223/Hwk1/Tests/t01.c

   To compare the output from your program with the expected output, type

     % ./Count < /c/cs223/Hwk1/Tests/t01.c | cmp - /c/cs223/Hwk1/Tests/t01.cs

   (cmp outputs the first character where the files differ) or

     % ./Count < /c/cs223/Hwk1/Tests/t01.c | diff - /c/cs223/Hwk1/Tests/t01.cs

   (diff outputs the lines where they differ but uses a looser definition for
   "identical") or

     %  /c/cs223/Hwk1/test.Count 01

   (you may specify more than one test here).

   If your output looks the same as what is expected, but your program still
   fails the test, there are probably some invisible characters in your output.
   To make all characters visible (except blanks), type

     % ./Count < /c/cs223/Hwk1/Tests/t01.c | cat -vet

   or

     % ./Count < /c/cs223/Hwk1/Tests/t01.c | od -bc

3. Keep track of how you spend your time in completing this assignment.  Your
   log file should be of the general form (that below is fictitious):

     ESTIMATE of time to complete assignment: 10 hours

           Time     Time
     Date  Started  Spent Work completed
     ----  -------  ----  --------------
     1/13  10:15pm  0:45  Read assignment and relevant material in K&R
     1/16   4:45pm  1:15  Sketched solution using a finite-state machine with
                            one-character look-ahead
     1/19   9:00am  2:20  Wrote the program and eliminated compile-time errors;
                            code passes eight tests
     1/20   7:05pm  2:00  Discovered and corrected two logical errors; code now
                            passes eleven tests
     1/23  11:00am  1:35  Finished debugging; program passes all public tests
                    ----
                    7:55  TOTAL time spent

     I discussed my solution with: Peter Salovey, Ben Polak, Tamar Gendler,
     and Jonathan Holloway (and watched four episodes of The Simpsons).

     <A brief discussion of the major difficulties encountered>

   but MUST contain

   * your estimate of the time required (made prior to writing any code),

   * the total time you actually spent on the assignment,

   * the names of all others (but not members of the teaching staff) with whom
     you discussed the assignment for more than 10 minutes, and

   * a brief discussion (100 words MINIMUM) of the major conceptual and coding
     difficulties that you encountered in developing and debugging the program
     (and there will always be some).

   This log will generally be worth 5-10% of the total grade.

   N.B.  To facilitate analysis, the log file MUST be the only file submitted
   whose name contains the string "log" and the estimate / total MUST be on the
   only line in that file that contains the string "ESTIMATE" / "TOTAL".

4. The submit program can be invoked in eight different ways:

     % /c/cs223/bin/submit  1  Makefile Count.c util.c time.log

   submits the named source files as your solution to Homework #1;

     % /c/cs223/bin/check  2

   lists the files that you submitted for Homework #2;

     % /c/cs223/bin/unsubmit  3  error.submit bogus.solution

   deletes the named files that you submitted previously for Homework #3 (which
   is useful if you rename a file or accidentally submit the wrong one);

     % /c/cs223/bin/makeit  4  Count

   runs "make" on the files that you submitted previously for Homework #4;

     % /c/cs223/bin/testit  5  Count

   runs the public test script for Count using the files that you submitted
   previously for Homework #5;

     % /c/cs223/bin/protect  6  Count.c time.log

   protects the named files that you submitted previously for Homework #6 (so
   they cannot be deleted accidentally);

     % /c/cs223/bin/unprotect  7  util.c time.log

   unprotects the named files that you submitted previously for Homework #7 (so
   they can be deleted); and

     % /c/cs223/bin/retrieve  8  common.c time.log

   and

     % /c/cs223/bin/retrieve  8  -d"2016/01/21 20:00" util.c

   retrieve copies of the named files that you submitted previously for
   Homework #8 (in case you accidentally delete your own copies).  The day
   and hour are optional and request the latest submission prior to that time
   (see the -d flag under "man co" for how to specify times).

5. When assignments are style graded, EVERY source file found in the submit
   directory will be reviewed.  Thus prudence suggests using unsubmit to remove
   a file from the directory when you change its name or it ceases to be part
   of your solution,

6. Prudence (and a 5-point penalty for code that does not make) suggests that
   you run makeit ("makeit 1 Count") after you have submitted the final version
   of your source files.  Better yet, run testit ("testit 1 Count").

7. Count is easier to write if you can peek at the next character in the
   standard input without reading it.  The macro

     #define ungetchar(c) ungetc(c,stdin)   // Unread char read from stdin

   allows you to push a character back onto the standard input.  That is, the
   character C "unread" will be the next character returned by getchar().  The
   value returned by ungetchar() is its argument, or EOF if the operation was
   unsuccessful.  Note: Every ungetchar() must be preceded by a getchar(),
   and you can only do one ungetchar() between successive getchar()'s.

   You may find this macro useful in writing Count since it allows you to read
   the next character and then decide that you should not have.

   Example:
        while ((c = getchar()) != EOF) {
            if (c == 'C') {
                c = getchar();
                if (c == 'S')
                    printf ("I found a CS in the standard input!\n");
                else
                    ungetchar(c);
            }
        }

8. The function exit() allows your program to stop immediately, without having
   to terminate any surrounding loops or to return to main() from a function.
   (To use it you must #include the header file <stdlib.h>.)

9. Count reads from stdin and writes to stdout but does no other input/output.

A. Features of C99 (but not ANSI C) that may be useful:

   * The characters // begin a comment that ends at the end of the line.

   * Variable declarations can appear anywhere within a code block; no longer
     must variables be defined at the top of a code block or outside all
     functions.
   
   * The header file stdbool.h defines type bool (meaning boolean) and symbolic
     constants true and false.  (To use it you must #include <stdlib.h>.)
   
   * Functions must declare a return value; no longer does the type default to
     int if no type is specified.

B. Hints:

   * You do not need any language features (e.g., "variable-length arrays" or
     strings or pointers or structs) not described in K&R, Chapter 1.  However,
     you may want to use enum instead of #define to define symbolic constants.

   * Handle line splices in ONE place (e.g., BEFORE testing for anything else).

C. Correct handling of line splices will be worth at most 10 points.  Correct
   handling of "else" will be worth 2 points.

D. Reading:
     Kernighan & Ritchie, Chapter 1 (introduction to C)
     Summit: https://www.eskimo.com/~scs/cclass/krnotes/sx4.html (K&R 1)
     Kernighan & Pike, Chapter 1 (style), Chapter 5 (debugging)
     Matthew & Stones, pp. 377-387 (makefiles), 429-445 (debugging)
     Matloff: http://heather.cs.ucdavis.edu/~matloff/Debug/Debug.pdf
   Optional:
     Aspnes: http://zoo.cs.yale.edu/classes/cs223/doc/howto.html
     Matthew & Stones, pp. 1-8 (Linux), 17-23 (bash)

E. My solution is 130 lines long (83 lines ignoring comments, blank lines,
   brace-only, and else-only lines, of which 15 were needed to handle else).
   If your solution looks to be much larger than this, you should talk to one
   of the instructional staff about your approach.
                                                                CS-223-01/20/16


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                Excerpts from Appendix A of Kernighan and Ritchie

A2  Lexical Conventions

    A program consists of one or more translation units stored in files.  It is
  translated in several phases, which are described in Section A12.  The first
  phases do low-level lexical transformations, carry out directives introduced
  by lines beginning with the # character, and perform macro definition and
  expansion.  When the preprocessing of Section A12 is complete, the program
  has been reduced to a sequence of tokens.

A2.2  Comments

    The characters /* introduce a comment, which terminates with the characters
  */.  Comments do not nest, and they do not occur within string or character
  literals.

A12  Preprocessing

    A preprocessor performs macro substitution, conditional compilation, and
  inclusion of named files.  Lines beginning with #, perhaps preceded by white
  space, communicate with this preprocessor.  The syntax of these lines is
  independent of the rest of the language; they may appear anywhere and have
  effect that lasts (independent of scope) until the end of the translation
  unit.  Line boundaries are significant; each line is analyzed individually
  (but see Section A12.2 for how to adjoin lines).  To the preprocessor, a
  token is any language token, or a character sequence giving a file name as
  in the #include directive (Section A12.4); in addition, any character not
  otherwise defined is taken as a token.  However, the effect of white space
  characters other than space and horizontal tab is undefined within
  preprocessor lines.

    Preprocessing itself takes place in several logically successive phases
  that may, in a particular implementation, be condensed.

  1. First, trigraph sequences as described in Section A12.1 are replaced by
     their equivalents.  Should the operating system environment require it,
     newline characters are introduced between the lines of the source file.

  2. Each occurrence of a backslash character \ followed by a newline is
     deleted, thus splicing lines (Section A12.2).

  3. The program is split into tokens separated by white-space characters;
     comments are replaced by a single space.  Then preprocessing directives
     are obeyed, and macros (Sections A12.3-A12.10) are expanded.

  4. Escape sequences in constants and string literals (Sections A2.5.2, A2.6)
     are replaced by their equivalents; then adjacent string literals are
     concatenated.
                                                                CS-223-01/08/09