Programming Assignment 6
An Improved Text Processing System
810:052
Data Structures
Fall Semester 1997
Due: by 10:00 AM on Monday, December 15
Background: Your Text Processing System
For Project 3, you implemented a small system
of classes to do a simple form of text analysis. Scholars use more
advanced forms of such analysis as the starting point for determining
the authenticity of texts. For example, many scholars actively debate
whether the works attributed to William Shakespeare were actually written
by various other people, most notably
Edward de Vere, 17th Earl of Oxford.
Your implementation used two existing classes -- Stradt and
List≤LE> -- and two classes of your own writing --
WordCount and TextAnalyst. Throughout the course of the
project, we identified several potential shortcomings in the existing
classes, particularly their implementations, and several potential
shortcomings in the specification of the new classes. One shortcoming was
simply a matter of data structure selection: Lists cannot provide
efficient enough performance when our program must manipulate large
collections of data. Hamlet and Much Ado About Nothing
qulaify as large documents.
Your Tasks
Your task for Project 6 is to improve your text processing system in
several ways.
- Improving the system in general.
- Be sure that all class declarations are made within
#ifndef...#endif pairs, so that we need not
worry about multiple definitions of classes on #include
directives.
- Within each file you write, #include any interface file
whose class you refer to directly within the file.
- Improving the Stradt class.
- #include <iostream.h> at the top of the interface.
- Add a member function void strip_punctuation(). In
response to this message, a Stradt will change itself
so that any leading or trailing punctuation characters are
removed. For example:
- A Stradt containing "SCENE.-" would change its
state to "SCENE"."
- A Stradt containing "[who" would change its state
to "who".
- A Stradt containing "[first,]" would change its
state to "first".
For the purposes of this assignment, a punctuation character
is any non-alphanumeric character ('a' through 'z', 'A' through
'Z', and '0' through '9').
- Add a member function void convert_to_lowercase(). In
response to this message, a Stradt will change all of
its chars to lowercase. For example:
- A Stradt containing "SCENE" would change its state
to "scene"."
- A Stradt containing "Hello" would change its state
to "hello".
- A Stradt containing "first" would change its
state to "first".
- Improving the WordCount class.
- Make sure that we have two ways to create WordCounts:
one in which we provide a Stradt argument, and one in
which we provide no arguments. In both cases, the count should
be set to 0. In the default constructor, the
WordCount's word should be set to a default
Stradt.
- Add operator== and operator< functions to
the class. The BST<T> uses these operations when
comparing items of type <T> in the tree.
- Improving the BST class.
- Add a void clear() operation that deletes the tree and
all memory allocated to it. You can the clear()
function defined for Roberge's binary search tree class as the
basis for your implementation.
- Improving the TextAnalyst class.
- Use a binary search tree of word counts as the (only) member
data object. This data structure provides more efficient
look-up operations than list does, while still allowing us to
iterate over the whole collection in order.
- Before inserting a word into the tree, strip all leading and
trailing punctuation and then convert it to all lowercase
characters. This will ensure that identical words that differ
only by case and sentence position will be treated as identical
words. Of course, if a word strips down to the empty string, do
not insert it into the tree.
- If necessary, convert your "starts with 'q', 'x', or 'z'"
member function to one that allows any character as an argument.
Your new int starts_with(char) member function should
return true if any word starts with the character passed.
- Improving the driver program.
- Use a procedure to display common code. For example, in my
implementation, I have one chunk of code that displays the
stats for Hamlet, and a nearly identical chunk of
code that displays the stats for Much Ado.
I should have a procedure
void display_text_stats(TextAnalyst) to which I can
pass a TextAnalyst. Then the driver would merely call
this procedure twice.
Source Files
Start with these source files. You should not need to re-copy them, since
you downloaded them for Project 3, but here they are, just in case:
If you would like to start with "clean" files implemented for Project 3,
here are mine. Feel free to use yours or mine, whichever you prefer.
Finally, here are the files for Berman's binary search tree ADT:
Deliverables
As always, your files must begin with a header block that includes the file
name, your name as author, and the creation and modification history of the
file. You may use this file
as a template. And don't forget to follow the programming
style sheet for the course!
Submit to your instructor, by the due time and date, the following:
- seven separate e-mail messages, each of which contains only a
single C++ file.
- your driver program
- the modified interface for TextAnalyst
- the new implementation for TextAnalyst
- the modified interface for WordCount
- the modified implementation for WordCount
- the modified interface for Stradt
- the modified implementation for Stradt
- a print-out of your files, stapled in the order listed above.
Eugene Wallingford ====
wallingf@cs.uni.edu ====
December 12, 1997