WORD & LINE CONCORDANCE
Since the concordance should only keep track of the "main" words, there will actually be two files of textual information. The first will be a list of stop words--these words will not be included in the concordance even if they do appear in the data file. The stop words are to be
stored in a second dictionary which will be accessed to ensure its contents do not appear in the concordance. The second file will be the actual data file. Words in this file are to be extracted, compared with the stop words, and, when appropriate, added to the concordance list along with the line number of the current occurrence of the word. Often, a word will be encountered several
times--each line of encounter is to be recorded in the concordance list, but each word is to appear only once. Finally, the words are to be printed out in alphabetical order along with the numbers of the lines in which they appear.
a) Words are defined to be sequences of letters that are delimited by any white space, punctuation, brackets, parentheses, dashes (two hyphens in a row), double quotes, etc. but not an apostrophe or single hyphens. For example, "it's" and "end-of-line-characters" should be considered words.
b) There is to be no distinction made between upper and lower case characters, i.e., "ADT" is the same word as "adt".
c) The line numbers are to relate to non-empty lines. Blank lines are not to be counted.
d) The application certainly should not reference the hash table implementation, except through the dictionary operations.
e) It is strongly suggested that the logic for reading words and assigning line numbers to them be developed and tested separately from other aspects of the program. This could be accomplished by reading a sample file and printing out the words recognized and the lines they appeared on with no effort to avoid duplicates or associate words with more than one line.
Since we don't know how many words we will have in our concordance, you should implement the dictionary ADT using a closed-address hash table with external chaining. Your hashing function should try to evenly distribute the words across all slots in the hash table using some method like folding and/or mid-square.
Section 126.96.36.199 of the text describes the necessary dictionary operations. They are:
The application code and the ADT code should be developed and tested separately. (I will not help anyone with a combined program if each part is not tested separately)
The stop words are in the file
The textual information to be examined is in the file
1) A linked implementation of a queue ADT may be useful to store the lines-of-occurence for each word.
2) An ordered linked list similar to section 7.2 may be useful to store the chain of items hashed to the same slot of the hash table.
3) You might want to use a regular expression to define a word. See the HOWTO at:
You are to electronically submit and turn in hardcopy printouts of:
(only turn in a hardcopy of the first and last page of the output, but electronically submit the whole file)