Body

This assignment has two parts--a dictionary ADT and a concordance-production application using the ADT. A Webster's dictionary definition of concordance is: "an alphabetical list of the main words in a work." In addition to the main words, I want you to keep track of all the line numbers where these words occur.

WORD & LINE CONCORDANCE

Since the concordance should only keep track of the "main" words, there will actually be two files of textual information. The first will be a list of stop words--these words will not be included in the concordance even if they do appear in the data file. The stop words are to be

stored in a second dictionary which will be accessed to ensure its contents do not appear in the concordance. The second file will be the actual data file. Words in this file are to be extracted, compared with the stop words, and, when appropriate, added to the concordance list along with the line number of the current occurrence of the word. Often, a word will be encountered several

times--each line of encounter is to be recorded in the concordance list, but each word is to appear only once. Finally, the words are to be printed out in alphabetical order along with the numbers of the lines in which they appear.

NOTES:

a) Words are defined to be sequences of letters that are delimited by any white space, punctuation, brackets, parentheses, dashes (two hyphens in a row), double quotes, etc. but not an apostrophe or single hyphens. For example, "it's" and "end-of-line-characters" should be considered words.

b) There is to be no distinction made between upper and lower case characters, i.e., "ADT" is the same word as "adt".

c) The line numbers are to relate to non-empty lines. Blank lines are not to be counted.

d) The application certainly should not reference the hash table implementation, except through the dictionary operations.

e) It is strongly suggested that the logic for reading words and assigning line numbers to them be developed and tested separately from other aspects of the program. This could be accomplished by reading a sample file and printing out the words recognized and the lines they appeared on with no effort to avoid duplicates or associate words with more than one line.

DICTIONARY ADT

Since we don't know how many words we will have in our concordance, you should implement the dictionary ADT using a closed-address hash table with external chaining. Your hashing function should try to evenly distribute the words across all slots in the hash table using some method like folding and/or mid-square.

Section 4.3.3.3 of the text describes the necessary dictionary operations. They are:

HashTable(size) - creates a new hash table with "size" slots
store(item, data) - stores a new piece of data in the hash table using the item as the key location. It returns nothing.
search(item) - returns the data value associated with the key item. It returns None if the key is not in the hash table.

The application code and the ADT code should be developed and tested separately. (I will not help anyone with a combined program if each part is not tested separately)

DATA FILES

The stop words are in the file

http://www.cs.uni.edu/~fienup/cs063s08/homework/stop_words.txt

The textual information to be examined is in the file

http://www.cs.uni.edu/~fienup/cs063s08/homework/hw4data.txt

HINTS:

1) A linked implementation of a queue ADT may be useful to store the lines-of-occurence for each word.

2) An ordered linked list similar to section 7.2 may be useful to store the chain of items hashed to the same slot of the hash table.

3) You might want to use a regular expression to define a word. See the HOWTO at:

http://docs.python.org/dev/howto/regex.html

SUBMISSION

You are to electronically submit and turn in hardcopy printouts of:

a one page overview of the design of your program and directions for running your program (file: design.txt)
all of your program files, and
the output file containing your word/line concordance produced by running your program.
(only turn in a hardcopy of the first and last page of the output, but electronically submit the whole file)