YALE UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
| CPSC 223b: Data Structures and Programming Techniques | Handout #7 | |
| Professor M. J. Fischer | February 5, 2008 | |
Problem Set 4
Due before midnight on Tuesday, February 12, 2008.
The editor of a newspaper wants to improve the writing of her reporters. She is tired of seeing overused words like “like” and long articles with limited vocabularaies and wants a tool to help her analyze writing style. To help her in her quest, you are to write a program that will read a text file containing a news article and produce an alphabetical list of words contained therein, where each word is preceded by a count of the number of times it occurs in the article.
To make the output more useful, the following conventions are to be observed:
Your program should read the file a raw word at a time. It should clean up each raw word and save each interesting word in a flex array (see below). After the file has been completely read, the flex array should be sorted alphabetically. This will bring like words together in the array. A pass over the array counts the number of occurrences of each word and prints out the count and the word.
I have supplied two modules for your use. They are on the Zoo in the PS4 assignment directory /c/cs223/assignments/ps4.
The first, util.c and util.h, are slightly updated versions of the ones for PS3. In addition to fatal(), they contain functions safe_malloc() and safe_realloc(), which you will need.
The other, flex.c and flex.h, implement a flex array data structure. This is a container that stores a list of strings of arbitrary length. It allows you to add strings one at a time and “never” runs out of space (unless the computer runs out of memory).
You are required to use my code to store the list of words since part of the purpose of this assignment is to learn how to interface to other people’s code. You should copy the.c and .h files to your own directory and compile them there, but you should not modify them.
Your own code should be in two modules that you create. The file main.c should contain only the main program main(); all other code goes in wordfreq.c. All that main() should do is to process its command line argument and open the input file. The processing of the contents of the file should be done by functions in wordfreq.c. You will need to write a header file wordfreq.h to be included in main.c that gives the prototype for the function that main() will call to process the file contents.
The function doing the processing should itself not be monolithic but should be split up into a number of smaller functions, each with a well-defined purpose. The processing naturally occurs in two phases: the first phase reads the file and builds the word list; the second phase sorts the word list and produce the output. Your code should reflect this two-phase structure by having separate functions for each phase. Note that the purpose of doing this is to make the code less complex and more managable, not simply to avoid code duplication. In addition, smaller units of code with well-defined purposes should be split out as separate functions. For example, it is good to have a function that takes as its argument a raw word and returns a cleaned-up version of that word.
As usual, you should supply test data to verify that your program correctly implements the requirements spelled out in this assignment.
You should provide a Makefile that will build an executable file called wordfreq which expects one command line argument, the name of a text file to process. You should submit Makefile, main.c, wordfreq.c, and wordfreq.h. You do not need to submit copies of the source and header files for util and flex since we already have them, and your code will be built against our files, not yours. As always, you should also submit your test input files and the output produced by your program on those files.