Mumps/MDH Toolkit
Experiments in Information Storage and Retrieval Using Mumps
5th Edition

Kevin C. O'Kane, Ph.D.
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
okane@cs.uni.edu
http://www.cs.uni.edu/~okane
May 4, 2008

Copyright (c) 2007, 2008 Kevin C. O'Kane, Ph.D.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts.


Contents

  1. Introduction
  2. Mumps
    1. Installing Mumps
    2. The Mumps Language
    3. Comparing Mumps Information Storage and Retrieval Implementations to Other Approaches
    4. Introduction to Mumps Programing
    5. Example Programs
      1. Program to build a global array index from the NLM MESH (Medical Subject Heading) Hierarchy
      2. Program to scan input text using builtin parsing and stem reducing functions
        1. OSU Medline Data Base
      3. Program to read Medline format abstracts and write out the list of MESH headings for each abstract along with the byte offset of the beginning of the abstract
      4. Sort the above and print, for each MESH heading, a count of the number of abstracts it occurs in
      5. Program to print all the headings in MESH code order
      6. Program that will, when given a keyword, locate the MESH heading containing the keyword and display the full heading, hierarchy codes, and adjacent keywords at this level
      7. Make the previous program run as a web server program
      8. Program to compute an optimal binary tree.
      9. Dump/restore and data base compression
      10. Sorting from Mumps
  3. Experimental Data Bases
  4. Vocabularies
  5. Zipf's Law
  6. Precision and Recall
  7. Dictionary Construction
  8. Stop Lists
    1. Building a Stop List
  9. Vector Space Model
    1. Concept
    2. Similarity Coeffiecients
      1. Example Calculations
      2. Related Similarity Functions
    3. Experiments
      1. Assigning Word Weights
      2. Inverse Document Frequency and Basic Vector Space
        1. OSU Medline Data Base IDF Weights
        2. Wikipedia Data Base IDF Weights
      3. Calculating IDF Weights
      4. Discrimination Coefficients and Simple Automatic Indexing
      5. Basic Retrieval
        1. Scanning the doc-term matrix
        2. Scanning the term-doc matrix
        3. Weighted scanning of the term-doc matrix
        4. Scripted test runs
        5. Simple Retrieval
        6. Faster Simple Retrieval
      6. Thesaurus and Phrase Construction
        1. Basic Term-Term Co-Occurrence Matrix
        2. Modified Basic Indexing - Position Specific
        3. Proximity Weighted Term-Term Correlation Matrix
        4. Term-Term clutsering
        5. Construction of Term Phrases
      7. Document-Document Matrices
      8. File and Document Clustering
      9. Web Page Access
      10. N-Gram Experiment
      11. Example Code
  10. Google Page Rank Algorithm
  11. Overview of Other Methods
    1. Boolean and Logic Programming Models
      1. Conjunctive Normal Form
      2. Disjunctive Normal Form
      3. Horn Clause
      4. Resolution
      5. Logic Programming
      6. Prolog
    2. Vector Space Model
    3. Probabilistic Model
    4. Fuzzy Set Model
    5. Latent Semantic Model
    6. Neural Network Model
  12. Text Tokens
    1. Single Term Based Indexing
    2. Phrase Based Indexing
    3. N-Gram Based Indexing
  13. Methodologies
    1. Stemming Algorithms
      1. Porter Stemming Algorithm
      2. The Lancaster Stemming Algorithm
      3. The Lovins stemming algorithm
      4. The Krovetz Stemmer
      5. Snowball: A language for stemming algorithms
    2. Text Searching
      1. Boyer Moore String Searching
      2. Knuth-Pratt-Morris Algorithm
    3. Parsing
  14. Visualization
  15. Open Directory
  16. Indexing and Retrieval of Genetic Text Collections
    1. Indexing Text Features in Genomic Repositories
    2. Sequence Matching
      1. Data Bases
        1. Genbank
        2. EMBL/EBI
      2. Alignment Algorithms
        1. Dot Plots
        2. Needleman-Wunsch
        3. Smith-Waterman
      3. Mumps Smith-Waterman Example
      4. FASTA
      5. BLAST
      6. Case Study: Indexing the "nt" Data Base
  17. Linguistics and Natural Language Processing
    1. Indo-European Languages
    2. Miscellaneous Linguistic Links
    3. Jakob Grimm
    4. Grimm's Law
    5. The Clair Library
  18. Basic Access Methods
    1. 64 Bit Addressing
    2. Sequential
    3. Random Access
      1. Basic Direct Access I/O
    4. Indexed Sequential
    5. Virtual Sequantial Access Method
  19. Key to Address Translation
    1. Hash tables
    2. Inverted Indices
    3. Lists
    4. Ordered lists
    5. Binary trees
      1. Huffman Trees
      2. Optimum Weight Balanced Trees
      3. Hu-Tucker Trees
      4. AVL Trees
    6. Tries and Suffix Trees
    7. B Trees
  20. Data Base Models
    1. Networked Data Bases
      1. CODASYL
    2. Hierarchical Data Bases
      1. IMS
      2. MUMPS
    3. Relational Data Bases
      1. Relational Calculus
      2. Relational Algebra
      3. SQL
      4. PostgreSQL
      5. MySQL
    4. Other
      1. Berkeley Data Base (Oracle)
  21. Other Topics
    1. Soundex Coding
    2. Readability Tests
    3. MD5 - Message Digest Algorithm 5
    4. Example Javascript Scripts
  22. Runnning information retrieval applications through Apache on Windows
  23. Data Bases
    1. Medical Subject Headings 2003 - mtrees2003.gz
    2. OSU-Medline Text - osu-medline.gz
    3. WikiPedia Text - wikipedia.txt.gz
  24. Experiments
    1. Wikipedia Results
      1. nohup-wiki-interp-linux
      2. nohup-wiki-interp-dos
      3. nohup-wiki-combined-dos
      4. nohup-wiki-combined-linux
      5. wiki.translated.txt.gz
      6. wiki.dictionary.sorted.gz
      7. wiki.zipf.gz
      8. wiki.good.gz
      9. wiki.idf.sorted.gz
      10. wiki.weighted-doc-vectors.gz
      11. wiki.weighted-term-vectors.gz
      12. wiki.cohesion.sorted.gz
      13. wiki.tt.sorted.gz
      14. wiki.jaccard-tt.sorted.gz
      15. wiki.dd2.gz
      16. wiki.clusters.gz
      17. wiki.discrim.sorted.gz
      18. wiki.ttfolder.gz
      19. sidhe.cs.uni.edu/cgi-bin/wiki/wikiWebFinder.cgi?query=anarchism Web Finder Demo (depends on server availability)
      20. sidhe.cs.uni.edu/cgi-bin/wiki/index.cgi?array=lib Folders Demo (depends on server availability).
    2. OSU Medline Results (Revised: March 16, 2007)
      1. nohup-medline-combined-linux
      2. nohup-medline-interp-linux
      3. nohup-medline-combined-dos
      4. nohup-medline-interp-dos
      5. medline.translated.txt.gz
      6. medline.dictionary.sorted.gz
      7. medline.zipf.gz
      8. medline.good.gz
      9. medline.idf.sorted.gz
      10. medline.weighted-doc-vectors.gz
      11. medline.weighted-term-vectors.gz
      12. medline.cohesion.sorted.gz
      13. medline.tt.sorted.gz
      14. medline.jaccard-tt.sorted.gz
      15. medline.dd2.gz
      16. medline.clusters.gz
      17. medline.discrim.sorted.gz
      18. medline.ttfolder.gz
      19. Web Finder Demo (depends on server availability).
      20. Folders Demo (depends on server availability).
    3. Computing Text Data Base
  25. References


Introduction

The purpose of this text is to illustrate several basic information storage and retrieval techniques through real world data experiments. Information retrieval is the art of identifying similarities between queries and objects in a database. In nearly all cases, the objects found as a result of the query will not be identical to the query but will resemble it in some fashion.

For example, if your query is "give me articles about aviation," the results might include articles about early pioneers in the field, technical reports on aircraft design, flight schedules on airlines, information on airports and so on. For example, the term "aviation" when typed into Google results in about 111,000,000 hits all of which have something to do with aviation.

Information retrieval isn't restricted to text retrieval. So, if you have a cut of a musical piece such as this (from the Beethoven 9th Symphony) and you want to find other music similar to it such as this (from the Beethoven Choral Fantasy), you need a retrieval engine that can detect the similarities.

Similar examples exist in many other areas. In Bioinformatics, researchers often identify DNA or protein sequences and search massive databases for similar (and sometimes only distantly related) sequences. For example, the DNA sequence:

>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobulin heavy chain, clone ScH 3.3 
TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGA
CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAA
CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGA
ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCCATCTCCAGAG
ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTGTGCTCGGGGC
GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCACCACCGACAGT
GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACCGGATTCTGCC
TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT

Where the first line identifies the name and library accession numbers of the sequence and the subsequent lines are the DNA nucleotide codes (the letters A, C, G, and T represent Adenine, Cytosine, Guanine, and Thymine, respectively). A program known as BLAST (Basic Local Alignment Sequencing Tool) can be used to find similar sequences in the online databases of known sequences. If you submit the above to NCBI BLAST (National Center for Biotechnology Information), they will conduct a search of their nr database of 6,284,619 nucleotide sequences, presently more than 22,427,755,047 bytes in length. The result is a ranked list of hits of sequences in the data base based on their similarity to the query sequence. Sequences found whose similarity score exceeds a threshold are displayed. One of these is:

>gb|U17058.1|LOU17058  Lepisosteus osseus Ig heavy chain V region mRNA, partial cds
Length=159

 Score =  151 bits (76),  Expect = 4e-33
 Identities = 133/152 (87%), Gaps = 0/152 (0%)
 Strand=Plus/Plus

Query  242  TGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGA  301
            |||||||| ||||||||| | | ||| || | |||||||||| |||||||||||||||||
Sbjct  4    TGGGTGGCGTATATTTACACCGATGGGAGCAATACATACTATTCCCAGTCTGTCCAGGGA  63

Query  302  AGATTCGCCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTG  361
            |||||| |||||||||||||| ||||||| |    |||||| ||||| |||| |||||||
Sbjct  64   AGATTCACCATCTCCAGAGACAATTCCAAGAATCAGCTGTACTTACAGATGAGCAGCCTG  123

Query  362  AAGACTGAAGACACTGCCGTGTATTACTGTGC  393
            ||||||||||||||||| ||||||||||||||
Sbjct  124  AAGACTGAAGACACTGCTGTGTATTACTGTGC  155

In the display from BLAST seen above, the sections of the query that match the sequence in the database are shown. The numbers at the neginning and ends of the lines are the sarting and ending points of the subsequence (relative to one, the start of all sequences). Where there are vertical lines between the query and the subject, there is an exact match. Where there are blanks, there was a mismatch.

It should be clear that, even though the subject is different than the query in many places, the two have a high degreee of similarity.

Also, consider the search for similar images. Again, this involves searching for similarities, not identity. For example, a human observer would clearly see the two following pictures as dealing with the same subject, despite the differences:

An obvious question would be, how can you write a computer program to see the similarity?

Mumps

In this book we will mainly deal with text retrieval. The main software tool will be an interpreter for the Mumps language. As noted below, the Mumps language was originally developed to handle hierarchical medical record data. It did this through a facility known as "global arrays". Global arrays are large (64 bit file addressed), multi-dimensional, string indexed, disk based data structures used to store text and numbers. In their simplest form, they are text indexed vectors. At their most complex, they are text indexed trees storing both code and information.

The examples in this text are written in the dialect of Mumps accepted by the Mumps Compiler and related Mumps Interpreter. These are open source (GPL/LGPL licensed) free software packages. While mainly designed to run under Linux and Cygwin, there are versions available for Windows XP.

Greatest efficiency and interoperability with other software is achieved by compiling Mumps programs to C++ and then to executables. However, for both Linux and Windows XP there are stand-alone interpreters for Mumps. While interpreters increase program run time (generally not too great as most of the examples shown here are predominantly I/O bound jobs), they are simple to install and run.

Installing Mumps

The examples in this text assume you are using the Mumps Interpreter with Cygwin or under Linux. Cygwin is a free, Linux-like environment that runs under Microsoft Windows. The following are the installation intsructions for Cygwin and Linux:

  1. Installation for Cygwin

    1. Get and install Cygwin for Windows
    2. Get and install Mumps Interpreter under Cygwin: In Cygwin, type the following commands:
      1. wget http://cns2.uni.edu/~okane/source/MUMPS-MDH/mumpscompiler-10.0.src.tar.gz
        (Note: check http://cns2.uni.edu/~okane/source/MUMPS-MDH/ for the latest version number).
      2. tar xvzf mumpscompiler-10.0.src.tar.gz
      3. cd mumpsc
      4. ./configure prefix=/usr
      5. make
      6. make install

  2. Installation for Linux

    1. Be sure your Linux has the PCRE (Perl Compatible Regular Expression) developmental library installed.
    2. Get and install Mumps Interpreter under Cygwin: In Linux as root, type:
      1. wget http://cns2.uni.edu/~okane/source/MUMPS-MDH/mumpscompiler-10.0.src.tar.gz
        (Note: check http://cns2.uni.edu/~okane/source/MUMPS-MDH/ for the latest version number).
      2. tar xvzf mumpscompiler-10.0.src.tar.gz
      3. cd mumpsc
      4. ./configure prefix=/usr
      5. make
      6. make install

  3. Installation for Windows without Cygwin

    1. Interpreter: download the file mumps.exe and place it in a directory that is in your search path (for example: \windows).

      This version of the interpreter is stand-alone and does not require Cygwin or the Microsoft C++ compiler. However, you will be missing many of the tools provided by Cygwin and Linux.

      This is not the recommended option. You may save time at the beginning but you will pay for it later.

      To run the interpreter, you must first open a command prompt window.

    2. You should download and install the GNU Win32 sort, tar, and gzip programs which are available in the 'Packages' section at http://gnuwin32.sourceforge.net/ . You should also consider installing the Vim editor which is available at http://www.vim.org/ . Since your Windows XP system already has a program named 'sort' you should to re-name the GNU version 'gnusort'. These packages should also be placed in a directory in your search path (e.g.: \windows).

      If you want to use and test web based software on your Windows PC, you will need a copy of the Apache Server for Windows XP. This can be downloaded from: http://www.apache.org/dist/httpd/binaries/win32/#released

      Find the latest file of the form: http://www.apache.org/dist/httpd/binaries/win32/apache_2.2.3-win32-x86-no_ssl.msi, download it then double click on it (it will initiate the self install procedure).

The Mumps Language

For information and documentation on the interpreter for Mumps Language, click this line

There are many languages that, over the years, have been used to implement Information Storage and Retrieval systems. In approach taken in this document, Mumps is used.

Mumps (also referred to as 'M') is a general purpose programming language that supports a native hierarchical data base facility. It is supported by a large user community (mainly biomedical), and a diversified installed application software base. The language originated in the mid-60's at the Massachusetts General Hospital and it became widely used in both clinical and commercial settings. A dwindling number of implementations exist for the language. There have been ANSI, ISO (ISO/IEC 11756:1992) and DOD approved standards for Mumps.

As originally conceived, Mumps differed from other mini-computer based languages of the late 1960's by providing: 1) an easily manipulated hierarchical (multi-dimensional) data base that was well suited to representing medical records; 2) flexible string handling support; and (3) multiple concurrent tasks in limited memory on very small machines. Syntactically, Mumps is based on an earlier language named JOSS and has an appearance that is similar to early versions of Basic that were also based on JOSS.

The Mumps Compiler is a translator that converts Mumps to C++. The Mumps Interpreter directly executes Mumps programs from source text files. With the compiler, Mumps programs are translated to standard C++ programs and subsequently compiled to binary executables. The compiler distribution contains the compiler source code, the manual, the run-time functions source code, all written in C/C++, and examples, written in Mumps. The Mumps Interpreter is actually just a compiled Mumps program that reads and executes (through the 'xecute' command) text.

The MDH (Multi-Dimensional and Hierarchical Data Base Toolkit) is a Linux-based, open sourced, toolkit of portable software that supports Mumps-compatible, very fast, flexible, multi-dimensional and hierarchical storage, retrieval and manipulation of data bases ranging in size up to 256 terabytes. The package is written in C and C++ and is available under the GNU GPL/LGPL licenses in source code form. You must install the Mumps Compiler in order to use the MDH.

Comparing Mumps Information Storage and Retrieval Implementations to Other Approaches

In order to evaluate different approaches to Information Storage and Retrieval experiments, we implemented in Mumps a basic automatic indexing experiment along the lines of that given in Chapter 9 of Salton (1989). Salton's approach makes heavy usage of vectors and matrices to store documents, terms, text, queries and intermediate results. From these experiments we were able to assess the viability of Mumps in terms of ease of use, speed, storage requirements, programmer productivity, and suitability to the programming problems at hand. The details are given below.

When working with a document set of any meaningful scope, vectors, matrices and file structures can quickly grow to enormous size. The information retrieval system was tested using a collection of documents concerning computer science. Each document consisted of a title, reference information, and an abstract averaging approximately 15 lines in length.

In one test, there were 5,614 documents with 132,502 word occurrences of which, not counting stop list words, 7,812 words were unique with an average frequency of use per word of approximately 15.

If viewed strictly as a two-dimensional array, the initial document-term matrix was 5,614 by 7,812 (43,856,568 elements) and the potential term-term correlation matrix on the unique words would have been in excess of 61 million elements.

Representing data structures of this size and providing fast, efficient, direct access to a value stored at any element is of critical importance to a matrix based implementation. An ideal implementation language will provide a transparent means by which the conceptual model can be realized through indexed access to elements of vectors and matrices by character string keyword rather than by numeric subscript as is typically the case in most languages. Furthermore, the extent and number of array dimensions must be dynamically establishable.

  1. Document-Term Matrix

    In our experiment, the size of the document-term matrix was 43,856,568 elements (5,614 documents by 7,812 terms). In this model, each row represents a document and the columns represent terms. The number in the natrix for a given document numberand term, is the frequency of occurrence of the term in trhe document.

    In a typical document-term matrix, many elements have values of zero. This is the case when a term does not occur in a particular document. In this experiment, the average number of terms per document was approximately 15. Thus, nearly 7,800 possible positions per row were zero (non-existent) in a typical case.

    In order to quickly access the rows, the locations of the rows should be predictable. That is, the rows should be of fixed length thus allowing a disk access method to access the vector for any document by multiplying the document number by the row size and thus calculating an offset relative to the start of the file where the record is located.

  2. Coded Tuples

    One approach to representing the matrix is to represent each row (document) as a collection of tuples each consisting of a token and a frequency. The token identifies the term and the frequency gives the weight of the term in the document. A minimum of four bytes would be required for each tuple. Allowing for 100 terms per row (document - a worst case estimate), this requires 2,245,600 byte file to represent the test collection (5614*100*4).

  3. Bit Maps

    Alternatively, a bit mapping model represents documents as positional binary vectors with a ``1'' indicating that a given term occurs in a document and a ``0'' indicating that it does not. While this is done to conserve space and improve vector access time, it also precludes the storage of information concerning the relative weight or strength of the term in a document. Using the test data set, a positional binary vector representation of each document would be 977 bytes in length for a total of 5,484,878 bytes for the collection as a whole.

  4. SQL

    A row-wise vector representation in which each term were represented by a numeric frequency count of two bytes would require 15,624 bytes per document (row) or 87,713,136 bytes to represent the entire collection.

  5. Mumps Global Arrays The Mumps Global array model stores only elements that exist along with indexing information. There were 83,895 non-zero elements in the document-term matrix. Each element consists of a frequency which, including overhead, required approximately 21 bytes for a total storage requirement of approximately 1,761,795 bytes for the collection as a whole. This constitutes a substantial reduction in overall storage requirements and results in faster file access. Such a figure also makes it possible to reasonably consider much larger data bases.


Introduction to Mumps Programing

See the Mumps book for a basic introduction to the Mumps Language.

  1. The Mumps programs described in this document can be run in either of two ways: either as interpreted code using the Mumps Interpreter or as binary executables resulting from application of the Mumps Compiler. Binary programs run faster that interpreted programs but the difference can be small if the programs are dominantly input/output jobs. The Mumps Interpreter is created by compiling the program "mumps.mps" provided with the distribution. For WinXP based systems, the interpreter may be a simpler and easier approach to developing applications. All programs that execute with the interpreter can be compiled by the compiler.

  2. Programs to be executed by the interpreter can have any extension but ".mps" is preferred.

  3. To run a Mumps Program, type:

    mumps myprog.mps
    

    Alternatively, if the first line of your Mumps program is:

    #!/usr/bin/mumps

    and your Mumps source code file has the executable attribute, you only need to type the name of the program. For example:

    #!/usr/bin/mumps
    # this is program hello.mps
          write "Hello world",!
          halt
    

    If the "hello.mps" has the executable attribute, you only need to type hello.mps to the Linux or Cygwin prompt.

  4. Unlike other languages which employ many data types, Mumps has basically one data type - string. Strings that contain numbers, however, can have arithmetic operations performed on them. Neither global nor local variables need to be declared - they will created as needed (note: an extension in the Mumps Compiler permits scalar variables to be pre-declared in order to improve performance). Variables can be destroyed by the KILL command.

  5. Global variables are normally used as arrays. In Mumps, an array reference is formed by the name of the variable followed by a parenthesized list of comma separated indices. The subscripts may be either numeric or character strings. Example:

    set ^patient("Jones, John", "Jan 10, 2005", "diagnosis" )="flu"

    The above creates a global array element addressed by three indices at which is stored the string "flu."

    Mumps arrays are not pre-declared and they are sparse. That is, only those elements which you explicitly create actually exist. For example, if you create element ^A(10), it does not necessarily mean that elements ^A(1) through ^A(9) exist.

    Global arrays are often interpreted as trees where each successive index describes the path through a multi-way tree. At each node, data can be stored (or not). The path from the root to a node is given by the sequence of indices of an array reference. The data base can store many trees, each distinguished by their array name.

    The Mumps global array facility is due to the early uses of Mumps in medical data bases which are basically hierarchical in nature. The Mumps global arrays were a solution to the problem of how to represent the tree-like structure of patient data in a simple and easily manipulated data structure.

    For example, consider a basic patient record. At the top level is the patient's id node at which is stored the patient's name. At the second level, are nodes for demographic (address, gender, phone number, etc.) data and the main entry node for clinical data. Clinical data is organized by diagnostic or problem category and each problem or diagnostic code is divided into episodes of the problem organized by onset date. For a given problem and onset, the data are divided by category (medications, lab tests, orders, notes, etc.) which are further subdivided by, for example, in the case of lab tests, test, date, time and result. For example:

    Here, the tree is named patient which is also the name of the global array (notice that global arrays always have a circumflex (^) preceeding their name). The Mumps code to populate the above might look like:

          set ^patient("123-45-6789")="Jones, John, J"
          set ^patient("123-45-6789","Demographics","Street")="123 Elm St"
          set ^patient("123-45-6789","Demographics","City")="Anytown"
          set ^patient("123-45-6789","Demographics","State")="IA"
          set ^patient("123-45-6789","Demographics","ZIP")="50613"
          set ^patient("123-45-6789","Dx",789.00,"6/23/2005")="Dr Smith"
          set ^patient("123-45-6789","Dx",789.00,"6/23/2005","lab","HCT","6/23/2005","10:45",45.2)=""
          set ^patient("123-45-6789","Dx",789.00,"6/23/2005","lab","HCT","6/23/2005","20:45",43.2)=""
          set ^patient("123-45-6789","Dx",789.00,"6/23/2005","lab","HCT","6/24/2005","21:10",44.2)=""
          set ^patient("123-45-6789","Dx",789.00,"6/23/2005","lab","HCT","6/25/2005","14:10",44.2)=""
    

    Notice that the empty string can be stored at a node. In these cases, the actual data (the lab test result) is the value of the final index. Also note that each intermediate node need not be created. The nodes representing ""Demographics", "lab", "Dx", HCT, and others are not explicityly created. Their creation is implicit in constructing the longer paths of which they are intermediates.

  6. All Mumps statements begin with a keyword. A keyword can be fully spelled out or, in many cases, abbreviated. The common keywords are:

    set read write if else halt hang use open close do for quit break

  7. After a keyword there is one blank followed by (in most cases) an "argument". An argument may not have any embedded blanks unless they are within quotes ("..."). Blanks are delimiters in Mumps and their use is restricted.

  8. More than one command may be on the same line if there is one or more blanks following the previous commands argument. Two or more blanks are required if the previous command had no argument. For example:
    set a="abc" write a,!
    quit:a=b  write a,!
    

    In the first line, both an assignment and write commands are on the same line. In the second line, the quit has no argument so there are two blanks separating it from the write The figure :a=b in the quit is not an argument: it is called a post-conditional. Post-conditionals are expressions that are evaluated before command execution. If true, the command is executed. If false, the command is not executed. Most commands may have post-conditionals attached to the command word.

  9. There is no precedence. All expressions are evaluated strictly left to right unless you use parens. This can cause problems if you are not careful. The expression:

    a+b-c*d/e means: ((((a+b)-c)*d)/e)
    
    Note that:
    
    if a=b&c=d write "hello",!  
    
    means:
    
    if (((a=b)&c)=d) write "hello",!
    
    which is probably not what you wanted.  You probably wanted:
    
    if (a=b)&(c=d) write "hello",!
    

  10. The main operators are +, -, *, /, \, #, **, _, >, <, [, ], ', ?, and @. See the manual for a description of the operators. Note that there is an integer division operator (\) as well as a floating point division operator (/).

  11. Mumps has many functions for string processing. These are covered in the manual and you should study them. Mumps also permits indirect execution, that is, your program can create and execute code dynamically.


Example Programs

  1. Write a program to build a global array index from the 2003 U.S. NLM MeSH (Medical Subject Heading) Tree Hierarchy. The MeSH codes are used to code medical records and are an ongoing research project of the U.S. National Library of Medicine. The codes used here are from 2003. Newer versions, essectially similar to these, are available from NLM. A local copy of the 2003 MeSH headings is in: 2003 U.S. NLM MeSH (Medical Subject Heading) Tree Hierarchy. Note: this copy is out of date and is used here purely as an example.

    The following is a sample of the MESH tree hierarchy:

    Body Regions;A01
    Abdomen;A01.047
    Abdominal Cavity;A01.047.025
    Peritoneum;A01.047.025.600
    Douglas' Pouch;A01.047.025.600.225
    Mesentery;A01.047.025.600.451
    Mesocolon;A01.047.025.600.451.535
    Omentum;A01.047.025.600.573
    Peritoneal Cavity;A01.047.025.600.678
    Retroperitoneal Space;A01.047.025.750
    Abdominal Wall;A01.047.050
    Groin;A01.047.365
    Inguinal Canal;A01.047.412
    Umbilicus;A01.047.849
    Back;A01.176
    Lumbosacral Region;A01.176.519
    Sacrococcygeal Region;A01.176.780
    Breast;A01.236
    Nipples;A01.236.500
    Extremities;A01.378
    Amputation Stumps;A01.378.100
    

    The format is: text description, semi-colon, code hierarchy. Thus, "Body Regions" is code A01, the "Abdomen" is A01.047, the Peritoneum is A01.047.025.600 and so forth. The goal is to build a global array tree where each successive index is a successive code in the MESH hierarchy and the text of each entry is stored in the tree at the appropriate level. Thus, we want something like:

    set ^mesh("A01")="Body Regions"
    set ^mesh("A01","047")="Abdomen"
    set ^mesh("A01","047","025")="Abdomenal Cavity"
    set ^mesh("A01","047","025","600")="Peritoneum"
    .
    .
    .
    set ^mesh("A01","047","365")="Groin"
    .
    .
    .
    

    This can be done with a program such as:

    #!/usr/bin/mumps
    #     mtree.mps January 13, 2008
    #     Copyright 2007 K. C. O'Kane - GPL License applies
          open 1:"mtrees2003.txt,old"
          for  do
          . use 1
          . read a
          . if '$test break
          . set key=$piece(a,";",1)  // text description
          . set code=$piece(a,";",2) // everything else
          . if key=""!(code="") break
    
          . for i=1:1 do
          .. set x(i)=$piece(code,".",i)  // extract code numbers
          .. if x(i)="" break
    
          . set i=i-1
          . use 5
          . set z="^mesh("        // begin building a global reference
    
    #-----------------------------------------------------------------------
    #     build a reference like ^mesh("A01","047","025","600)
    #     by concatenating quotes, codes, quotes, and commas onto z
    #-----------------------------------------------------------------------
    
          . for j=1:1:i-1 set z=z_""""_x(j)_""","
          . set z="set "_z_""""_x(i)_""")="""_key_""""
    
    #-----------------------------------------------------------------------
    #     z now looks like set ^mesh("A01","047")="Abdomen"
    #     now execute the text
    #-----------------------------------------------------------------------
    
          . write z,!
          . xecute z
    
          close 1
          use 5
          write "done",!
          halt
    

    Notes:

    • In the program above, for each line of the mesh2003.txt read, a string containing the text of a "set" command like those shown above is created.

    • Note that to embed a double-quote character (") into a string, you place two immediately adjacent double-quote characters into the string. Thus: """" means a string of length one containing a double-quote character.

    • The final string is passed as an argument the xecute` command. The command xecute treats its argument as a line of Mumps code (with limitations, however), and executes it as though it were part of the original program. Execution of this type uses the interpreter and, consequently is much slower than compiled code.

    • Notice the line:

      . if key=""!(code="") break

      uses the OR operator (!). Also note the use of parentheses needed since execution of expressions in Mumps does not rely on precedence.

    • Notice the line:

      . for j=1:1:i-1 set z=z_""""_x(j)_""","

      uses the concatenation operator (_) as well as a local array x(j). Local arrays should be used as little as possible since access to them through the Mumps run-time symbol table can be slow if thiere are a lot of elenment in the sysmbol table.

    • Notice near the end the close command that releases the file associated with unit 1 and, consequently, makes it available for reuse. Closing a file opened for input is not stritly needed unless you want to reuse the unit number. Closing a file opend for output is necessary in order to flush the internal system buffers to disk. If the program crashes before an output file is closed, it is possible to lose data.

    • The output looks like this:

      
      set ^mesh("A01")="Body Regions"
      set ^mesh("A01","047")="Abdomen"
      set ^mesh("A01","047","025")="Abdominal Cavity"
      set ^mesh("A01","047","025","600")="Peritoneum"
      set ^mesh("A01","047","025","600","225")="Douglas' Pouch"
      set ^mesh("A01","047","025","600","451")="Mesentery"
      set ^mesh("A01","047","025","600","451","535")="Mesocolon"
      set ^mesh("A01","047","025","600","573")="Omentum"
      set ^mesh("A01","047","025","600","678")="Peritoneal Cavity"
      set ^mesh("A01","047","025","750")="Retroperitoneal Space"
      set ^mesh("A01","047","050")="Abdominal Wall"
      set ^mesh("A01","047","365")="Groin"
      set ^mesh("A01","047","412")="Inguinal Canal"
      set ^mesh("A01","047","849")="Umbilicus"
      set ^mesh("A01","176")="Back"
      set ^mesh("A01","176","519")="Lumbosacral Region"
      set ^mesh("A01","176","780")="Sacrococcygeal Region"
      set ^mesh("A01","236")="Breast"
      set ^mesh("A01","236","500")="Nipples"
      set ^mesh("A01","378")="Extremities"
      set ^mesh("A01","378","100")="Amputation Stumps"
      set ^mesh("A01","378","610")="Lower Extremity"
      set ^mesh("A01","378","610","100")="Buttocks"
      set ^mesh("A01","378","610","250")="Foot"
      set ^mesh("A01","378","610","250","149")="Ankle"
      set ^mesh("A01","378","610","250","300")="Forefoot, Human"
      set ^mesh("A01","378","610","250","300","480")="Metatarsus"
      
      
      .
      .
      .
      

    • You may print the global ^mesh() data base as follows:

      #!/usr/bin/mumps
      # mtreeprint.mps January 13, 2008
            for lev1=$order(^mesh(lev1)) do
            . write lev1," ",^mesh(lev1),!
            . for lev2=$order(^mesh(lev1,lev2)) do
            .. write ?5,lev2," ",^mesh(lev1,lev2),!
            .. for lev3=$order(^mesh(lev1,lev2,lev3)) do
            ... write ?10,lev3," ",^mesh(lev1,lev2,lev3),!
            ... for lev4=$order(^mesh(lev1,lev2,lev3,lev4)) do
            .... write ?15,lev4," ",^mesh(lev1,lev2,lev3,lev4),!
      
      yields:
      
      A01 Body Regions
           047 Abdomen
                025 Abdominal Cavity
                     600 Peritoneum
                     750 Retroperitoneal Space
                050 Abdominal Wall
                365 Groin
                412 Inguinal Canal
                849 Umbilicus
           176 Back
                519 Lumbosacral Region
                780 Sacrococcygeal Region
           236 Breast
                500 Nipples
           378 Extremities
                100 Amputation Stumps
                610 Lower Extremity
                     100 Buttocks
                     250 Foot
                     400 Hip
                     450 Knee
                     500 Leg
                     750 Thigh
                800 Upper Extremity
                     075 Arm
                     090 Axilla
                     420 Elbow
                     585 Forearm
                     667 Hand
                     750 Shoulder
           456 Head
                313 Ear
                505 Face
                     173 Cheek
                     259 Chin
                     420 Eye
                     580 Forehead
                     631 Mouth
                     733 Nose
                     750 Parotid Region
                810 Scalp
                830 Skull Base
                     150 Cranial Fossa, Anterior
                     165 Cranial Fossa, Middle
                     200 Cranial Fossa, Posterior
           598 Neck
           673 Pelvis
                600 Pelvic Floor
           719 Perineum
           911 Thorax
                800 Thoracic Cavity
                     500 Mediastinum
                     650 Pleural Cavity
                850 Thoracic Wall
           960 Viscera
      A02 Musculoskeletal System
           165 Cartilage
                165 Cartilage, Articular
                207 Ear Cartilages
                410 Intervertebral Disk
                507 Laryngeal Cartilages
                     083 Arytenoid Cartilage
                     211 Cricoid Cartilage
                     411 Epiglottis
                     870 Thyroid Cartilage
                590 Menisci, Tibial
                639 Nasal Septum
           340 Fascia
                424 Fascia Lata
           513 Ligaments
                170 Broad Ligament
                514 Ligaments, Articular
                     100 Anterior Cruciate Ligament
                     162 Collateral Ligaments
                     287 Ligamentum Flavum
                     350 Longitudinal Ligaments
                     475 Patellar Ligament
                     600 Posterior Cruciate Ligament
      
      .
      .
      .
      

      Alternatively, using some of the newer Mumps functions, the table can be printed as:

      #!/usr/bin/mumps
      #       mtyreeprintnew.mps January 13, 2008
              set x="^mesh(0)"
              for  do
              . set x=$query(x)
              . if x="" break
              . if $piece(x,"(",1)'="^mesh" break
              . set i=$qlength(x)
              . write ?i*2," ",$qsubscript(x,i)," ",@x,?50,x,!
      

      which produces the output:

        A01 Body Regions                              ^mesh("A01")
          047 Abdomen                                 ^mesh("A01","047")
            025 Abdominal Cavity                      ^mesh("A01","047","025")
              600 Peritoneum                          ^mesh("A01","047","025","600")
                225 Douglas' Pouch                    ^mesh("A01","047","025","600","225")
                451 Mesentery                         ^mesh("A01","047","025","600","451")
                  535 Mesocolon                       ^mesh("A01","047","025","600","451","535")
                573 Omentum                           ^mesh("A01","047","025","600","573")
                678 Peritoneal Cavity                 ^mesh("A01","047","025","600","678")
              750 Retroperitoneal Space               ^mesh("A01","047","025","750")
            050 Abdominal Wall                        ^mesh("A01","047","050")
            365 Groin                                 ^mesh("A01","047","365")
            412 Inguinal Canal                        ^mesh("A01","047","412")
            849 Umbilicus                             ^mesh("A01","047","849")
          176 Back                                    ^mesh("A01","176")
            519 Lumbosacral Region                    ^mesh("A01","176","519")
            780 Sacrococcygeal Region                 ^mesh("A01","176","780")
          236 Breast                                  ^mesh("A01","236")
            500 Nipples                               ^mesh("A01","236","500")
          378 Extremities                             ^mesh("A01","378")
            100 Amputation Stumps                     ^mesh("A01","378","100")
            610 Lower Extremity                       ^mesh("A01","378","610")
              100 Buttocks                            ^mesh("A01","378","610","100")
              250 Foot                                ^mesh("A01","378","610","250")
                149 Ankle                             ^mesh("A01","378","610","250","149")
                300 Forefoot, Human                   ^mesh("A01","378","610","250","300")
                  480 Metatarsus                      ^mesh("A01","378","610","250","300","480")
                  792 Toes                            ^mesh("A01","378","610","250","300","792")
                    380 Hallux                        ^mesh("A01","378","610","250","300","792","380")
                510 Heel                              ^mesh("A01","378","610","250","510")
              400 Hip                                 ^mesh("A01","378","610","400")
              450 Knee                                ^mesh("A01","378","610","450")
              500 Leg                                 ^mesh("A01","378","610","500")
              750 Thigh                               ^mesh("A01","378","610","750")
            800 Upper Extremity                       ^mesh("A01","378","800")
              075 Arm                                 ^mesh("A01","378","800","075")
              090 Axilla                              ^mesh("A01","378","800","090")
              420 Elbow                               ^mesh("A01","378","800","420")
              585 Forearm                             ^mesh("A01","378","800","585")
              667 Hand                                ^mesh("A01","378","800","667")
                430 Fingers                           ^mesh("A01","378","800","667","430")
                  705 Thumb                           ^mesh("A01","378","800","667","430","705")
                715 Wrist                             ^mesh("A01","378","800","667","715")
              750 Shoulder                            ^mesh("A01","378","800","750")
          456 Head                                    ^mesh("A01","456")
            313 Ear                                   ^mesh("A01","456","313")
            505 Face                                  ^mesh("A01","456","505")
              173 Cheek                               ^mesh("A01","456","505","173")
              259 Chin                                ^mesh("A01","456","505","259")
              420 Eye                                 ^mesh("A01","456","505","420")
                338 Eyebrows                          ^mesh("A01","456","505","420","338")
                504 Eyelids                           ^mesh("A01","456","505","420","504")
                  421 Eyelashes                       ^mesh("A01","456","505","420","504","421")
              580 Forehead                            ^mesh("A01","456","505","580")
              631 Mouth                               ^mesh("A01","456","505","631")
                515 Lip                               ^mesh("A01","456","505","631","515")
      

  2. Write a program to scan input text using builtin parsing and stem reducing functions to reduce the text to word tokens.

    1. OSU Medline Data Base

      The text (which will be used in subsequent examples) OSU Medline Data Base is derived from the TREC-9 Filtering Track . The TREC (Text REtrieval Conferences) are annual events sponsored by the National Institute for Standards and Technology (NIST). The TREC-9 Filtering Track data base consists of a collection of medically related titles and abstracts:

      "... The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. The National Library of Medicine has agreed to make the MEDLINE references in the test database available for experimentation, restricted to the following conditions:

      1. The data will not be used in any non-experimental clinical, library, or other setting.

      2. Any human users of the data will explicitly be told that the data is incomplete and out-of-date.

      The OHSUMED document collection was obtained by William Hersh (hersh@OHSU.EDU) and colleagues for the experiments described in the papers below:

      Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.

      Hersh WR, Hickam DH, Use of a multi-application computer workstation in a clinical setting, Bulletin of the Medical Library Association, 1994, 82: 382-389. ..."

      http://trec.nist.gov/data/t9_filtering/README

      Data from the OHSUMED file were modified and edited into a format similar to that currently used by MEDLINE in order to present a more easily managed file. The original format used many very long lines which were inconvenient to manipulate as well as a number of fields that were not of interest for this study. The conversion programs are given in the experimental data bases section below. The revised data base has the following appearance:

      STAT- MEDLINE
      MH    Acetaldehyde/*ME
      MH    Buffers
      MH    Catalysis
      MH    HEPES/PD
      MH    Nuclear Magnetic Resonance
      MH    Phosphates/*PD
      MH    Protein Binding
      MH    Ribonuclease, Pancreatic/AI/*ME
      MH    Support, U.S. Gov't, Non-P.H.S.
      MH    Support, U.S. Gov't, P.H.S.
      TI    The binding of acetaldehyde to the active site of ribonuclease: alterations in catalytic ...
      AB    Ribonuclease A was reacted with [1-13C,1,2-14C]acetaldehyde 
            and sodium cyanoborohydride in the presence or absence 
            of 0.2 M phosphate. After several hours of incubation 
            at 4 degrees C (pH 7.4) stable acetaldehyde-RNase adducts 
            were formed, and the extent of their formation was 
            similar regardless of the presence of phosphate. Although 
            the total amount of covalent binding was comparable 
            in the absence or presence of phosphate, this active 
            site ligand prevented the inhibition of enzymatic activity 
            seen in its absence. This protective action of phosphate 
            diminished with progressive ethylation of RNase, indicating 
            that the reversible association of phosphate with the 
            active site lysyl residue was overcome by the irreversible 
            process of reductive ethylation. Modified RNase was 
            analysed using 13C proton decoupled NMR spectroscopy. 
            Peaks arising from the covalent binding of enriched 
            acetaldehyde to free amino groups in the absence of 
            phosphate were as follows: NH2-terminal alpha amino 
            group, 47.3 ppm; bulk ethylation at epsilon amino groups 
            of nonessential lysyl residues, 43.0 ppm; and the epsilon 
            amino group of lysine-41 at the active site, 47.4 ppm. 
            In the spectrum of RNase ethylated in the presence 
            of phosphate, the peak at 47.4 ppm was absent. When 
            RNase was selectively premethylated in the presence 
            of phosphate, to block all but the active site lysyl 
            residues and then ethylated in its absence, the signal 
            at 43.0 ppm was greatly diminished, and that arising 
            from the active site lysyl residue at 47.4 ppm was 
            enhanced. These results indicate that phosphate specifically 
            protected the active site lysine from reaction with 
            acetaldehyde, and that modification of this lysine 
            by acetaldehyde adduct formation resulted in inhibition 
            of catalytic activity.
      
      STAT- MEDLINE
      MH    Adult
      MH    Alcohol, Ethyl/*AN
      MH    Breath Tests/*
      MH    Human
      MH    Irrigation
      MH    Male
      MH    Middle Age
      MH    Mouth/*
      MH    Temperature
      MH    Water
      TI    Reductions in breath ethanol readings in normal male volunteers following mouth ...
      AB    Blood ethanol concentrations were measured sequentially, 
            over a period of hours, using a Lion AE-D2 alcolmeter, 
            in 12 healthy male subjects given oral ethanol 0.5 
            g/kg body wt. Readings were taken before and after 
            rinsing the mouth with water at varying temperatures. 
            Mouth rinsing resulted in a reduction in the alcolmeter 
            readings at all water temperatures tested. The magnitude 
            of the reduction was greater after rinsing with water 
            at lower temperatures. This effect occurs because rinsing 
            cools the mouth and dilutes retained saliva. This finding 
            should be taken into account whenever breath analysis 
            is used to estimate blood ethanol concentrations in 
            experimental situations.
      
      .
      .
      .
      
      (Note: long lines truncated from the above)
      

      Here are some programs to process the modified osu-medline file with the goal of reducing the data set to a set of word stems:

      
      #!/usr/bin/mumps
      #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      #+
      #+     Mumps Information Storage and Retrieval Software Library
      #+     Copyright (C) 2006 by Kevin C. O'Kane
      #+
      #+     Kevin C. O'Kane
      #+     okane@cs.uni.edu
      #+
      #+
      #+ This program is free software; you can redistribute it and/or modify
      #+ it under the terms of the GNU General Public License as published by
      #+ the Free Software Foundation; either version 2 of the License, or
      #+ (at your option) any later version.
      #+
      #+ This program is distributed in the hope that it will be useful,
      #+ but WITHOUT ANY WARRANTY; without even the implied warranty of
      #+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
      #+ GNU General Public License for more details.
      #+
      #+ You should have received a copy of the GNU General Public License
      #+ along with this program; if not, write to the Free Software
      #+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
      #+
      #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      
      # reformat.mps January 15, 2008
      
            open 1:"osu-medline,old"
            if '$test write "osu-medline file not found",! halt
      
            set D=0  // document counter
      
            for  do
      #     . if D>10000 break  // for testing purposes - limits number of articles to be read
            . use 1 set off=$ztell read line  // note position in file ($ztell) then read a line
            . if '$test break // no more input
      
      # If this is a title line, increment the document counter and record the file position of the title line.
      
            . if $extract(line,1,2)="TI" set D=D+1,^doc(D)=off use 5 write "xxxTIxxx ",$extract(line,7,1023),! quit
      
      # If this is an Mesh Heading (MH), recode with xxxMHxxx marker followed by actual Mesh heading
      
            . if $extract(line,1,2)="MH" use 5 write "xxxMHxxx ",$extract(line,7,1023),! quit
      
      # recode STAT- MEDLINE to xxxSTATMEDLINExxx
      
            . if $extract(line,1,13)="STAT- MEDLINE" use 5 write "xxxSTATMEDLINExxx ",! quit
            . if $extract(line,1,2)'="AB" quit
      
      # For abstracts (AB), extract all lines of abstract and write them as one long line.
      # Abstract ends on end-of-file or empty line
      
            . use 5 write "xxxABxxx ",$extract(line,7,1023)," "
            . for  do  // for each line of the abstract
            .. use 1 read line
            .. if '$test break // no more input
            .. if line="" break
            .. set line=$extract(line,7,255)
            .. use 5 write line," "
            . use 5 write ! // line after abstract
      
      yields (note: long lines from the abstracts will appear wrapped in the following):
      

      xxxSTATMEDLINExxx
      xxxMHxxx Acetaldehyde/*ME
      xxxMHxxx Buffers
      xxxMHxxx Catalysis
      xxxMHxxx HEPES/PD
      xxxMHxxx Nuclear Magnetic Resonance
      xxxMHxxx Phosphates/*PD
      xxxMHxxx Protein Binding
      xxxMHxxx Ribonuclease, Pancreatic/AI/*ME
      xxxMHxxx Support, U.S. Gov't, Non-P.H.S.
      xxxMHxxx Support, U.S. Gov't, P.H.S.
      xxxTIxxx The binding of acetaldehyde to the active site of ribonuclease: alterations in catalytic activity and effects of phosphate.
      xxxABxxx Ribonuclease A was reacted with [1-13C,1,2-14C]acetaldehyde and sodium cyanoborohydride in the presence or absence of 0.2 M phosphate. After several hours of incubation at 4 degrees C (pH 7.4) stable acetaldehyde-RNase adducts were formed, and the extent of their formation was similar regardless of the presence of phosphate. Although the total amount of covalent binding was comparable in the absence or presence of phosphate, this active site ligand prevented the inhibition of enzymatic activity seen in its absence. This protective action of phosphate diminished with progressive ethylation of RNase, indicating that the reversible association of phosphate with the active site lysyl residue was overcome by the irreversible process of reductive ethylation. Modified RNase was analysed using 13C proton decoupled NMR spectroscopy. Peaks arising from the covalent binding of enriched acetaldehyde to free amino groups in the absence of phosphate were as follows: NH2-terminal alpha amino group, 47.3 ppm; bulk ethylation at epsilon amino groups of nonessential lysyl residues, 43.0 ppm; and the epsilon amino group of lysine-41 at the active site, 47.4 ppm. In the spectrum of RNase ethylated in the presence of phosphate, the peak at 47.4 ppm was absent. When RNase was selectively premethylated in the presence of phosphate, to block all but the active site lysyl residues and then ethylated in its absence, the signal at 43.0 ppm was greatly diminished, and that arising from the active site lysyl residue at 47.4 ppm was enhanced. These results indicate that phosphate specifically protected the active site lysine from reaction with acetaldehyde, and that modification of this lysine by acetaldehyde adduct formation resulted in inhibition of catalytic activity.
      xxxSTATMEDLINExxx
      xxxMHxxx Adult
      xxxMHxxx Alcohol, Ethyl/*AN
      xxxMHxxx Breath Tests/*
      xxxMHxxx Human
      xxxMHxxx Irrigation
      xxxMHxxx Male
      xxxMHxxx Middle Age
      xxxMHxxx Mouth/*
      xxxMHxxx Temperature
      xxxMHxxx Water
      xxxTIxxx Reductions in breath ethanol readings in normal male volunteers following mouth rinsing with water at differing temperatures.
      xxxABxxx Blood ethanol concentrations were measured sequentially, over a period of hours, using a Lion AE-D2 alcolmeter, in 12 healthy male subjects given oral ethanol 0.5 g/kg body wt. Readings were taken before and after rinsing the mouth with water at varying temperatures. Mouth rinsing resulted in a reduction in the alcolmeter readings at all water temperatures tested. The magnitude of the reduction was greater after rinsing with water at lower temperatures. This effect occurs because rinsing cools the mouth and dilutes retained saliva. This finding should be taken into account whenever breath analysis is used to estimate blood ethanol concentrations in experimental situations.
      xxxSTATMEDLINExxx
      xxxMHxxx Alcoholism/*PP
      xxxMHxxx Animal
      xxxMHxxx Diprenorphine/PD
      xxxMHxxx Female
      xxxMHxxx Morphine/*PD
      xxxMHxxx Naloxone/PD
      xxxMHxxx Naltrexone/PD
      xxxMHxxx Narcotic Antagonists/*PD
      xxxMHxxx Rats
      xxxMHxxx Rats, Inbred Strains
      xxxMHxxx Receptors, Endorphin/*DE/PH
      xxxMHxxx Seizures/PP
      xxxMHxxx Substance Withdrawal Syndrome/PP
      xxxTIxxx Does the blockade of opioid receptors influence the development of ethanol dependence?
      xxxABxxx We have tested whether the opioid antagonists naloxone (2 mg/kg), naltrexone (2 mg/kg) and diprenorphine (0.2 mg/kg), and the agonist morphine (4-8 mg/kg) given subcutaneously (10 min before ethanol for 7 days) modify the ethanol withdrawal syndrome (audiogenic seizures) following chronic ethanol intoxication in rats. We found that naloxone, naltrexone and diprenorphine modified the ethanol withdrawal syndrome. These findings do not rule out the possibility of a biochemical link between the action of ethanol and opiates at the level of opioid receptors.
      xxxSTATMEDLINExxx
      xxxMHxxx Adult
      xxxMHxxx Alcohol Drinking/*PH
      xxxMHxxx Alcoholism/*BL/CO
      xxxMHxxx Erythrocyte Indices/*
      xxxMHxxx Female
      xxxMHxxx Follow-Up Studies
      xxxMHxxx Gamma-Glutamyltransferase/*BL
      xxxMHxxx Hepatomegaly/ET
      xxxMHxxx Human
      xxxMHxxx Male
      xxxMHxxx Middle Age
      xxxMHxxx Predictive Value of Tests
      xxxMHxxx Sex Factors
      xxxTIxxx Drinkwatchers--description of subjects and evaluation of laboratory markers of heavy drinking.
      xxxABxxx Clinical examination and measurement of MCV and GGT were carried out on 124 self-referred 'healthy' Drinkwatchers, all of whom had consumed at least 80 g alcohol/day for more than 2 years. The majority (66.1%) were in social classes II and III. Sixty-three subjects (54.1%) had a raised MCV, GGT or hepatomegaly. A raised MCV was significantly more likely to occur in men. Forty-five subjects (36.3%) had an enlarged liver of whom 17 had a normal MCV and GGT. This study shows that MCV and GGT are poor screening tests for excessive alcohol consumption in 'healthy' subjects but, if used at all, MCV appears to be more sensitive in women and GGT in men. Neither test is an adequate substitute for a careful history and full clinical examination.
      xxxSTATMEDLINExxx
      xxxMHxxx Adult
      xxxMHxxx Alcoholism/*BL
      xxxMHxxx Blood Platelets/*ME
      xxxMHxxx Erythrocyte Indices
      xxxMHxxx Gamma-Glutamyltransferase/BL
      xxxMHxxx Human
      xxxMHxxx In Vitro
      xxxMHxxx Kinetics
      xxxMHxxx Middle Age
      xxxMHxxx Serotonin/*BL
      xxxMHxxx Support, Non-U.S. Gov't
      xxxTIxxx Platelet affinity for serotonin is increased in alcoholics and former alcoholics: a biological marker for dependence?
      xxxABxxx The kinetics of 3H serotonin platelet uptake were studied in alcoholics and former alcoholics to see whether differences found between alcohol-preferring and non-preferring rats could be reproduced in man. Three groups of patients were studied: 10 dependent alcoholics on admission for treatment; 10 dependent alcoholics after 20 days of treatment; 8 former dependent alcoholics, abstinent for 1-11 years. Controls were non-alcoholics, matched for age and sex. The Km for 3H serotonin uptake in platelets was lower in patients from all three groups compared to 15 controls. This phenomenon could be congenital or induced by the previous excessive intake of alcohol. We believe that this increased platelet affinity for serotonin, in the absence of cirrhosis of the liver and/or depression could be a marker for alcohol dependence, enabling the therapeutic effort to be focussed on these patients.

      
      The output of the above is redirected to be the input of the following program
      with the command:
      
            reformat.mps | stems.mps > revised-data
      
      #!/usr/bin/mumps
      #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      #+
      #+     Mumps Information Storage and Retrieval Software Library
      #+     Copyright (C) 2006, 2008 by Kevin C. O'Kane
      #+
      #+     Kevin C. O'Kane
      #+     okane@cs.uni.edu
      #+
      #+
      #+ This program is free software; you can redistribute it and/or modify
      #+ it under the terms of the GNU General Public License as published by
      #+ the Free Software Foundation; either version 2 of the License, or
      #+ (at your option) any later version.
      #+
      #+ This program is distributed in the hope that it will be useful,
      #+ but WITHOUT ANY WARRANTY; without even the implied warranty of
      #+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
      #+ GNU General Public License for more details.
      #+
      #+ You should have received a copy of the GNU General Public License
      #+ along with this program; if not, write to the Free Software
      #+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
      #+
      #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      
      # stems.mps January 13, 2008
      
      # convert data base to word stems
      
      
      # Scan for each word reducing all to lower case, and accepting only
      # words in the 3 to 25 character length range.  Write appropriate
      # code markes and line feeds.  Reduce all ther words to stems by
      # calling on $ztsem()
      
            for  do
            . set word=$zzScanAlnum
            . if '$test break  // end of file
            . if word="xxxstatmedlinexxx" write !!,word quit
            . if word="xxxabxxx" write !,word," " quit
            . if word="xxxmhxxx" write !,word," " quit
            . if word="xxxtixxx" write !,word," " quit
            . write $zstem(word)," "
      
      which yields a file of length restricted, lower case, stems:
      

      xxxstatmedlinexxx
      xxxmhxxx acetaldehyde
      xxxmhxxx buffer
      xxxmhxxx catalysis
      xxxmhxxx hepe
      xxxmhxxx nuclear magnetic resonance
      xxxmhxxx phosphate
      xxxmhxxx protein bind
      xxxmhxxx ribonuclease pancreatic
      xxxmhxxx support gov non
      xxxmhxxx support gov
      xxxtixxx the bind acetaldehyde the active site ribonuclease alteration catalytic active and effect phosphate
      xxxabxxx ribonuclease was react with acetaldehyde and sodium cyanoborohydride the presence absence phosphate after severe hour incubation degree stable acetaldehyde rnase adduct were form and the extent their formation was similar regardless the presence phosphate although the total amount covalent bind was compaare the absence presence phosphate this active site ligand prevent the inhibition enzymatic active seen its absence this protect action phosphate diminish with progressive ethylate rnase indicate that the revers association phosphate with the active site lysyl residue was overcome the irrevers process reductive ethylate modify rnase was analyse using proton decouple nmr spectroscopy peak aris from the covalent bind enrich acetaldehyde free amino group the absence phosphate were follow nh2 terminal alpha amino group ppm bulk ethylate epsilon amino group nonessential lysyl residue ppm and the epsilon amino group lysine the active site ppm the spectrum rnase ethylate the presence phosphate the peak ppm was absent when rnase was selective premethylate the presence phosphate block all but the active site lysyl residue and then ethylate its absence the sign ppm was great diminish and that aris from the active site lysyl residue ppm was enhance these result indicate that phosphate specific protect the active site lysine from reaction with acetaldehyde and that modification this lysine acetaldehyde adduct formation result inhibition catalytic active

      xxxstatmedlinexxx
      xxxmhxxx adult
      xxxmhxxx alcohol ethyl
      xxxmhxxx breath test
      xxxmhxxx human
      xxxmhxxx irrigation
      xxxmhxxx male
      xxxmhxxx middle age
      xxxmhxxx mouth
      xxxmhxxx temperature
      xxxmhxxx water
      xxxtixxx reduction breath ethanol reading norm male volunteer follow mouth rins with water differ temperature
      xxxabxxx blood ethanol concentration were measure sequential over period hour using lion alcolmeter healthy male subject given oral ethanol body reading were taken before and after rins the mouth with water vary temperature mouth rins result reduct the alcolmeter reading all water temperature test the magnitude the reduct was greater after rins with water lower temperature this effect occur because rins cool the mouth and dilute retane saliva this find should taken into account whenever breath analysis used estimate blood ethanol concentration experiment situation

      xxxstatmedlinexxx
      xxxmhxxx alcoholism
      xxxmhxxx anim
      xxxmhxxx diprenorphine
      xxxmhxxx female
      xxxmhxxx morphine
      xxxmhxxx naloxone
      xxxmhxxx naltrexone
      xxxmhxxx narcotic antagonist
      xxxmhxxx rats
      xxxmhxxx rats inbr strain
      xxxmhxxx receptor endorphin
      xxxmhxxx seizure
      xxxmhxxx substance withdraw syndrome
      xxxtixxx does the blockade opioid receptor influence the development ethanol dependence
      xxxabxxx have test whether the opioid antagonist naloxone naltrexone and diprenorphine and the agonist morphine given subcutaneous min before ethanol for days modify the ethanol withdraw syndrome audiogenic seizure follow chronic ethanol intoxication rats found that naloxone naltrexone and diprenorphine modify the ethanol withdraw syndrome these finding not rule out the poss biochemical link between the action ethanol and opiate the level opioid receptor

      xxxstatmedlinexxx
      xxxmhxxx adult
      xxxmhxxx alcohol drink
      xxxmhxxx alcoholism
      xxxmhxxx erythrocyte indice
      xxxmhxxx female
      xxxmhxxx follow study
      xxxmhxxx gamma glutamyltransferase
      xxxmhxxx hepatomega
      xxxmhxxx human
      xxxmhxxx male
      xxxmhxxx middle age
      xxxmhxxx predictive value test
      xxxmhxxx sex factor
      xxxtixxx drinkwatcher description subject and evaluation laboratory marker heavy drink
      xxxabxxx clinical examination and measure mcv and ggt were carry out self refer healthy drinkwatcher all whom had consume least alcohol day for more than year the majore were soci class and iii sixty three subject had rais mcv ggt hepatomega rais mcv was significant more like occur men forty five subject had enlarge liver whom had norm mcv and ggt this study show that mcv and ggt are poor screene test for excessive alcohol consumption healthy subject but used all mcv appear more sensitive women and ggt men neither test adequate substitute for careful history and full clinical examination

      xxxstatmedlinexxx
      xxxmhxxx adult
      xxxmhxxx alcoholism
      xxxmhxxx blood platelet
      xxxmhxxx erythrocyte indice
      xxxmhxxx gamma glutamyltransferase
      xxxmhxxx human
      xxxmhxxx vitro
      xxxmhxxx kinetic
      xxxmhxxx middle age
      xxxmhxxx serotonin
      xxxmhxxx support non gov
      xxxtixxx platelet affinity for serotonin increased alcoholic and former alcoholic biological marker for dependence
      xxxabxxx the kinetic serotonin platelet uptake were study alcoholic and former alcoholic see whether difference found between alcohol preferre and non preferre rats could reproduce man three group patient were study dependent alcoholic admiss for treatment dependent alcoholic after days treatment former dependent alcoholic abstinent for year control were non alcoholic match for age and sex the for serotonin uptake platelet was lower patient from all three group compare control this phenomenon could congenital induce the previous excessive intake alcohol believe that this increased platelet affinity for serotonin the absence cirrhosis the liver and depress could marker for alcohol dependence enable the therapeutic effort focuss these patient


    2. Write a program to read Medline format abstracts (from the modified TREC-9 data base described above and write out the list of MESH headings for each abstract along with the byte offset of the beginning of the abstract. The lines with text MeSH headings in the data base have the code MH in positions one and 2. There is a blank line that signals the end of each abstract. Offsets are the byte offset relative to the start of the file where the entry for the abstract and related material began. This number can be used to retrieve the entry. Place # between the text and the offset. The figure # does not exists as text in any MeSH heading and can be used as a separator when the file is subsequently read.

      #!/usr/bin/mumps
      # readmedline.mps January 15, 2008
            open 1:"osu-medline,old"
            use 1
            set i=$ztell // return the integer offset in the file
            for  do
            . use 1
            . read a
            . if '$test break
            . if a="" set i=$ztell quit  // return the offset in the file
            . if $extract(a,1,3)="MH " do
            .. use 5
            .. set a=$piece($extract(a,7,255),"/",1)
            .. write a,"#",i,!
      
      

      A portion of the output from which is:

      Acetaldehyde#0
      Buffers#0
      Catalysis#0
      HEPES#0
      Nuclear Magnetic Resonance#0
      Phosphates#0
      Protein Binding#0
      Ribonuclease, Pancreatic#0
      Support, U.S. Gov't, Non-P.H.S.#0
      Support, U.S. Gov't, P.H.S.#0
      Adult#2401
      Alcohol, Ethyl#2401
      Breath Tests#2401
      Human#2401
      Irrigation#2401
      Male#2401
      Middle Age#2401
      Mouth#2401
      Temperature#2401
      Water#2401
      Alcoholism#3479
      Animal#3479
      Diprenorphine#3479
      Female#3479
      Morphine#3479
      Naloxone#3479
      Naltrexone#3479
      Narcotic Antagonists#3479
      Rats#3479
      Rats, Inbred Strains#3479
      Receptors, Endorphin#3479
      Seizures#3479
      Substance Withdrawal Syndrome#3479
      Adult#4510
      Alcohol Drinking#4510
      Alcoholism#4510
      Erythrocyte Indices#4510
      

      Notes:

      • The function $ztell gives the byte offset of the current file. If your version of Mumps was generated with the --with-file64 option, this number can exceed 2 GB. On systems generated without the --with-file64 option, this value, and, consequently, the largest files size, is limited to 2 GB.

      • Some MeSH headings have extraneous text following the main code and separated from the main code by a "/" character. Only text prior to and "/" character is extracted. The $piece() function returns either the entire string or, if "/" is present, just the part of the string prior to "/".

    3. Sort the above and print, for each MESH heading, a count of the number of abstracts it occurs in using the command:

      readmedline.mps | sortmedline.mps > words

      #!/usr/bin/mumps
      # sortmedline.mps January 15, 2008
            kill ^MH
            for  do
            . read a
            . if '$test break
            . set b=$piece(a,"#")
            . if $data(^MH(b)) set ^MH(b)=^MH(b)+1
            . else  set ^MH(b)=1
            set x=""
            for  do
            . set x=$order(^MH(x))
            . if x="" break
            . write x," -> ",^MH(x),!
      
      

      The output of which looks like:

      Accreditation -> 20
      Acculturation -> 2
      Acebutolol -> 5
      Acenocoumarol -> 1
      Acetabulum -> 29
      Acetaldehyde -> 22
      Acetals -> 1
      Acetamides -> 3
      Acetaminophen -> 36
      Acetates -> 15
      Acetazolamide -> 8
      Acetic Acid Esters -> 1
      Acetic Acids -> 13
      Acetoacetates -> 2
      Acetone -> 3
      Acetophenones -> 3
      Acetoxyacetylaminofluorene -> 1
      Acetyl CoA Acetyltransferase -> 2
      Acetyl CoA Acyltransferase -> 1
      Acetyl CoA Carboxylase -> 2
      Acetyl Coenzyme A -> 2
      Acetylation -> 8
      Acetylcholine -> 87
      Acetylcholinesterase -> 22
      Acetylcysteine -> 8
      Acetylgalactosamine -> 2
      Acetylglucosamine -> 2
      Acetylglucosaminidase -> 10
      Acetylprocainamide -> 1
      Acetyltransferases -> 9
      Achievement -> 8
      Achilles Tendon -> 12
      Achlorhydria -> 4
      Acholeplasma laidlawii -> 1
      Achondroplasia -> 6
      Acid Etching, Dental -> 2
      Acid Phosphatase -> 18
      Acid-Base Equilibrium -> 26
      Acid-Base Imbalance -> 5
      Acidosis -> 34
      Acidosis, Lactic -> 19
      Acidosis, Renal Tubular -> 8
      Acidosis, Respiratory -> 7
      Acids -> 5
      Acinetobacter -> 2
      Acinetobacter Infections -> 3
      
      Notes:

      • The line of code:

        kill ^MH

        deletes any instances of the ^MH global array. Similar forms can be used to delete sub-trees. See the manual.

      • The program reads from standard input and either creates a new instance of ^MH indexed by the MeSH heading word or increments an existing one. The offset portion of each input line is not used.

      • When the input is exhausted, the program prints the MeSH headings and the number of times they occurred. Since global arrays are stored in collating sequence (ASCII) order, the results are printed in collating sequence order.

    4. Write a program to print all the headings in MESH code order. Assume the "^mesh()" global array created in the example above. In that example, the keys were organized like this:

      ^mesh("A01")                                     
      ^mesh("A01","047")                              
      ^mesh("A01","047","025")                       
      ^mesh("A01","047","025","600")                
      ^mesh("A01","047","025","600","225")         
      ^mesh("A01","047","025","600","451")        
      ^mesh("A01","047","025","600","451","535")      
      ^mesh("A01","047","025","600","573")           
      ^mesh("A01","047","025","600","678")          
      ^mesh("A01","047","025","750")               
      ^mesh("A01","047","050")                    
      ^mesh("A01","047","365")                   
      ^mesh("A01","047","412")                  
      ^mesh("A01","047","849")                 
      ^mesh("A01","176")                      
      

      with the text of the MeSH code stored at each node.

      This is also the order in which they are stored in the file system. The Mumps function $query() can be used to dump the file system in sequential key order. You pass to $query() a string containing a global array reference (with embedded quotes around string indices). It returns the next array reference in the file system. Eventually, you will run out of "^mesh" references and receive an empty string (indicating you are at the end of the global array system) or the beginning of a new global array whose name is alphabetically higher than "^mesh". Consequently, you must test to determine if (1) you received the empty string and (2) if the name of the array has changed.

      #!/usr/bin/mumps
      # meshheadings.mps January 15, 2008
            set x="^mesh"  // build the first index
            for  do
            . set x=$query(x) // get next array reference
            . if x="" break
            . if $piece(x,"(",1)'="^mesh" break
            . write x,?50,@x,!
      

      The output of both of which looks like:

      ^mesh("A01")                                     Body Regions
      ^mesh("A01","047")                               Abdomen
      ^mesh("A01","047","025")                         Abdominal Cavity
      ^mesh("A01","047","025","600")                   Peritoneum
      ^mesh("A01","047","025","600","225")             Douglas' Pouch
      ^mesh("A01","047","025","600","451")             Mesentery
      ^mesh("A01","047","025","600","451","535")       Mesocolon
      ^mesh("A01","047","025","600","573")             Omentum
      ^mesh("A01","047","025","600","678")             Peritoneal Cavity
      ^mesh("A01","047","025","750")                   Retroperitoneal Space
      ^mesh("A01","047","050")                         Abdominal Wall
      ^mesh("A01","047","365")                         Groin
      ^mesh("A01","047","412")                         Inguinal Canal
      ^mesh("A01","047","849")                         Umbilicus
      ^mesh("A01","176")                               Back
      ^mesh("A01","176","519")                         Lumbosacral Region
      ^mesh("A01","176","780")                         Sacrococcygeal Region
      ^mesh("A01","236")                               Breast
      ^mesh("A01","236","500")                         Nipples
      ^mesh("A01","378")                               Extremities
      ^mesh("A01","378","100")                         Amputation Stumps
      ^mesh("A01","378","610")                         Lower Extremity
      ^mesh("A01","378","610","100")                   Buttocks
      ^mesh("A01","378","610","250")                   Foot
      ^mesh("A01","378","610","250","149")             Ankle
      ^mesh("A01","378","610","250","300")             Forefoot, Human
      ^mesh("A01","378","610","250","300","480")       Metatarsus
      ^mesh("A01","378","610","250","300","792")       Toes
      ^mesh("A01","378","610","250","300","792","380") Hallux
      ^mesh("A01","378","610","250","510")             Heel
      ^mesh("A01","378","610","400")                   Hip
      ^mesh("A01","378","610","450")                   Knee
      ^mesh("A01","378","610","500")                   Leg
      ^mesh("A01","378","610","750")                   Thigh
      ^mesh("A01","378","800")                         Upper Extremity
      ^mesh("A01","378","800","075")                   Arm
      ^mesh("A01","378","800","090")                   Axilla
      ^mesh("A01","378","800","420")                   Elbow
      ^mesh("A01","378","800","585")                   Forearm
      ^mesh("A01","378","800","667")                   Hand
      ^mesh("A01","378","800","667","430")             Fingers
      ^mesh("A01","378","800","667","430","705")       Thumb
      ^mesh("A01","378","800","667","715")             Wrist
      ^mesh("A01","378","800","750")                   Shoulder
      

      Notes:

      • the function $query() returns a string containing the next global array reference with indices enclosed in quotes.

      • The line:

              . write x,?50,@x,!
        

        displays the reference (x) and then prints the contents of the node x (@x).

    5. Write a program that will, when given a keyword, locate all the MESH headings containing the keyword and display the full heading, hierarchy codes, and adjacent keywords at this level. In effect, this program gives you all the more specific terms related to a higher level, more general term. It locates all instances of the term typed.

      #!/usr/bin/mumps
      # findmesh.mps January 15, 2008
            read "enter keyword: ",key
            write !
            set x="^mesh"  // build a global array ref
            set x=$query(x)
            if x="" halt
            for  do
            . if '$find(@x,key) set x=$query(x) // is key stored at this ref?
            . else  do
            .. set i=$qlength(x)  // number of subscripts
            .. write x," ",@x,!
            .. for  do
            ... set x=$query(x)  
            ... if x="" halt
            ... if $piece(x,"(",1)'="^mesh" break
            ... if $qlength(x)'>i break
            ... write ?5,x," ",@x,! 
            . if x="" halt
      
      
      which yields when given Skeleton as the input:
      
      enter keyword: Skeleton
      ^mesh("A02","835") Skeleton
          ^mesh("A02","835","232") Bone and Bones
          ^mesh("A02","835","232","087") Bones of Upper Extremity
          ^mesh("A02","835","232","087","144") Carpal Bones
          ^mesh("A02","835","232","087","144","650") Scaphoid Bone
          ^mesh("A02","835","232","087","144","663") Semilunar Bone
          ^mesh("A02","835","232","087","227") Clavicle
          ^mesh("A02","835","232","087","412") Humerus
          ^mesh("A02","835","232","087","535") Metacarpus
          ^mesh("A02","835","232","087","702") Radius
          ^mesh("A02","835","232","087","783") Scapula
          ^mesh("A02","835","232","087","783","261") Acromion
          ^mesh("A02","835","232","087","911") Ulna
          ^mesh("A02","835","232","169") Diaphyses
          ^mesh("A02","835","232","251") Epiphyses
          ^mesh("A02","835","232","251","352") Growth Plate
          ^mesh("A02","835","232","300") Foot Bones
          ^mesh("A02","835","232","300","492") Metatarsal Bones
          ^mesh("A02","835","232","300","710") Tarsal Bones
          ^mesh("A02","835","232","300","710","300") Calcaneus
          ^mesh("A02","835","232","300","710","780") Talus
          ^mesh("A02","835","232","409") Hyoid Bone
          ^mesh("A02","835","232","500") Leg Bones
          ^mesh("A02","835","232","500","247") Femur
          ^mesh("A02","835","232","500","247","343") Femur Head
          ^mesh("A02","835","232","500","247","510") Femur Neck
          ^mesh("A02","835","232","500","321") Fibula
          ^mesh("A02","835","232","500","624") Patella
          ^mesh("A02","835","232","500","883") Tibia
          ^mesh("A02","835","232","611") Pelvic Bones
          ^mesh("A02","835","232","611","108") Acetabulum
          ^mesh("A02","835","232","611","434") Ilium
          ^mesh("A02","835","232","611","548") Ischium
          ^mesh("A02","835","232","611","781") Pubic Bone
          ^mesh("A02","835","232","730") Sesamoid Bones
          ^mesh("A02","835","232","781") Skull
          ^mesh("A02","835","232","781","200") Cranial Sutures
          ^mesh("A02","835","232","781","292") Ethmoid Bone
          ^mesh("A02","835","232","781","324") Facial Bones
          ^mesh("A02","835","232","781","324","502") Jaw
          ^mesh("A02","835","232","781","324","502","125") Alveolar Process
          ^mesh("A02","835","232","781","324","502","125","800") Tooth Socket
          ^mesh("A02","835","232","781","324","502","320") Dental Arch
          ^mesh("A02","835","232","781","324","502","632") Mandible
          ^mesh("A02","835","232","781","324","502","632","130") Chin
          ^mesh("A02","835","232","781","324","502","632","600") Mandibular Condyle
          ^mesh("A02","835","232","781","324","502","645") Maxilla
          ^mesh("A02","835","232","781","324","502","660") Palate, Hard
          ^mesh("A02","835","232","781","324","665") Nasal Bone
          ^mesh("A02","835","232","781","324","690") Orbit
          ^mesh("A02","835","232","781","324","948") Turbinates
          ^mesh("A02","835","232","781","324","995") Zygoma
          ^mesh("A02","835","232","781","375") Frontal Bone
          ^mesh("A02","835","232","781","572") Occipital Bone
          ^mesh("A02","835","232","781","572","434") Foramen Magnum
          ^mesh("A02","835","232","781","651") Parietal Bone
          ^mesh("A02","835","232","781","750") Skull Base
          ^mesh("A02","835","232","781","750","150") Cranial Fossa, Anterior
          ^mesh("A02","835","232","781","750","165") Cranial Fossa, Middle
          ^mesh("A02","835","232","781","750","400") Cranial Fossa, Posterior
          ^mesh("A02","835","232","781","802") Sphenoid Bone
          ^mesh("A02","835","232","781","802","662") Sella Turcica
          ^mesh("A02","835","232","781","885") Temporal Bone
          ^mesh("A02","835","232","781","885","444") Mastoid
          ^mesh("A02","835","232","781","885","681") Petrous Bone
          ^mesh("A02","835","232","834") Spine
          ^mesh("A02","835","232","834","151") Cervical Vertebrae
          ^mesh("A02","835","232","834","151","213") Atlas
          ^mesh("A02","835","232","834","151","383") Axis
          ^mesh("A02","835","232","834","151","383","668") Odontoid Process
          ^mesh("A02","835","232","834","229") Coccyx
          ^mesh("A02","835","232","834","432") Intervertebral Disk
          ^mesh("A02","835","232","834","519") Lumbar Vertebrae
          ^mesh("A02","835","232","834","717") Sacrum
          ^mesh("A02","835","232","834","803") Spinal Canal
          ^mesh("A02","835","232","834","803","350") Epidural Space
          ^mesh("A02","835","232","834","892") Thoracic Vertebrae
          ^mesh("A02","835","232","904") Thorax
          ^mesh("A02","835","232","904","567") Ribs
          ^mesh("A02","835","232","904","766") Sternum
          ^mesh("A02","835","232","904","766","442") Manubrium
          ^mesh("A02","835","232","904","766","825") Xiphoid Bone
          ^mesh("A02","835","583") Joints
          ^mesh("A02","835","583","032") Acromioclavicular Joint
          ^mesh("A02","835","583","097") Atlanto-Axial Joint
          ^mesh("A02","835","583","101") Atlanto-Occipital Joint
          ^mesh("A02","835","583","156") Bursa, Synovial
          ^mesh("A02","835","583","192") Cartilage, Articular
          ^mesh("A02","835","583","290") Elbow Joint
          ^mesh("A02","835","583","345") Finger Joint
          ^mesh("A02","835","583","345","512") Metacarpophalangeal Joint
          ^mesh("A02","835","583","378") Foot Joints
          ^mesh("A02","835","583","378","062") Ankle Joint
          ^mesh("A02","835","583","378","531") Metatarsophalangeal Joint
          ^mesh("A02","835","583","378","831") Tarsal Joints
          ^mesh("A02","835","583","378","831","780") Subtalar Joint
          ^mesh("A02","835","583","378","900") Toe Joint
          ^mesh("A02","835","583","411") Hip Joint
          ^mesh("A02","835","583","443") Joint Capsule
          ^mesh("A02","835","583","443","800") Synovial Membrane
          ^mesh("A02","835","583","443","800","800") Synovial Fluid
          ^mesh("A02","835","583","475") Knee Joint
          ^mesh("A02","835","583","475","590") Menisci, Tibial
          ^mesh("A02","835","583","512") Ligaments, Articular
          ^mesh("A02","835","583","512","100") Anterior Cruciate Ligament
          ^mesh("A02","835","583","512","162") Collateral Ligaments
          ^mesh("A02","835","583","512","162","500") Lateral Ligament, Ankle
          ^mesh("A02","835","583","512","162","600") Medial Collateral Ligament, Knee
          ^mesh("A02","835","583","512","287") Ligamentum Flavum
          ^mesh("A02","835","583","512","350") Longitudinal Ligaments
          ^mesh("A02","835","583","512","475") Patellar Ligament
          ^mesh("A02","835","583","512","600") Posterior Cruciate Ligament
          ^mesh("A02","835","583","656") Pubic Symphysis
          ^mesh("A02","835","583","707") Sacroiliac Joint
          ^mesh("A02","835","583","748") Shoulder Joint
          ^mesh("A02","835","583","781") Sternoclavicular Joint
          ^mesh("A02","835","583","790") Sternocostal Joints
          ^mesh("A02","835","583","861") Temporomandibular Joint
          ^mesh("A02","835","583","861","900") Temporomandibular Joint Disk
          ^mesh("A02","835","583","959") Wrist Joint
          ^mesh("A02","835","583","979") Zygapophyseal Joint
      ^mesh("A11","284","295","154","200") Cell Wall Skeleton
      ^mesh("D12","776","097","162") Cell Wall Skeleton
      ^mesh("D12","776","395","560","186") Cell Wall Skeleton
      ^mesh("E01","370","350","700","050") Age Determination by Skeleton
      

      Notes:

      • Read in the keyword. Build in a string a global array reference containing the first key: ^mesh and locate the first indices of this array with $query().

      • In a loop, get the text value stored at the global array reference and check to see if it contains the keyword that was typed. This is done by the line below reading:

        . if $find(@x,key) do  // is key stored at this ref?
        

        which uses indirection to get the text value stored (@x evaluates to the contents of the global array reference in x). The $find() function searches the text for any substring containing the input key.

      • If $find() does not find the key, the next global array reference is found with $query(). The value returned by $query() is checked that it is not empty. If not, the loop continues with the next reference to the array "^mesh".

      • If the key word is found in the text, print the referencece and scan for additional references whose number of subscripts is greater than that of the found reference (sub tress of the found reference). The function $qlength() returns the number of subscripts in a reference.

    6. Make the previous program run as a web server information storage and retrieval application.

      The following intsructions are for Cygwin. Slightly different instructions pertain if you are using Linux. Also, if you are using Linux, there will be file access protection issues that are not present under Cygwin.

      First, start Cygwin then start the Apache web server with the command:

      /usr/sbin/httpd

      You may see a warning message about your server's fully qualified name. This may be safely ignored. Now move the following HTML file (we'll call it isr.html) to /var/www/htdocs and make it world readable (chmod a+r isr.html).

      HTML file query3.html:
      <html> <head> <title> Example server side Mumps Program</title> </head> <body bgcolor=silver> Enter a MeSH term: &nbsp; <form method="get" action=cgi-bin/isr.mps> <input type=text size=30 name=key value="Head"> &nbsp; <input type=submit> </form> </body> </html>

      Next move the following file to /var/www/cgi-bin and make it world readable and executable.

      #!/usr/bin/mumps # isr.mps January 15, 2008 html Content-type: text/html &!&! html <html><body bgcolor=silver> if '$data(key) write "No keyword supplied</body></html>",! halt html <center>Results for &~key~</center><hr> html <pre> set x="^mesh" // build a global array ref set x=$query(x) if x="" halt if $piece(x,"(",1)'="^mesh" break for do . if '$find(@x,key) set x=$query(x) // is key stored at this ref? . else do .. set i=$qlength(x) // number of subscripts .. write x," ",@x,! .. for do ... set x=$query(x) ... if x=""!($piece(x,"(",1)'="^mesh") write "</pre></body></html>",! halt ... if $qlength(x)'>i break ... write ?5,x," ",@x,! . if x="" write "</pre></body></html>",! halt

      Now copy mtree.mps and mtrees2003.txt to /var/www/cgi-bin. Run mtree.mps then make the datbase files key.dat and data.dat world readable and world writable. These now contain the MESH data base that the server side query program isr.mps will access.

      Some notes:

      • The HTML file isr.html sets up an HTML form that can be used to collect information. The FORM tag allows for single line text, a box of text, radio buttons, check boxes and selection lists (drop down boxes). Each item of data collected, upon clicking the SUBMIT button, is placed into a URL and sent to the web server. In this example, only a line of text is collected.

      • The "value" strings are encoded as follows by the browser: alphabetics and numerics remain unchanged; blanks become plus signs and all other characters appear in the form %XX where XX is a hexadecimal number indicating the character's collating sequence value. If more than one name=value figures is appended to the URL, they, are separated from one another by and ampersand (&).

      • The interpreter automatically reads QUERY_STRING (which contains the parameters following the question mark) and decodes them. For each "name" found, it creates a variable of the same name in the Mumps run-time symbol table initialized to the "value" field. Names should be unique although non-unique names can be handled (see the manual).

      • When your CGI program runs, its output is captured by the web server and sent back to the originating browser. The first thing you send to the web server MUST be the line:
        
        html Content-type: text/html &!&!
        
        

        exactly as typed. This tells the web server what's coming next. After this line, everything sent would be in HTML format. The Mumps command "html" is an output command that causes the remainder of the line to be written to the web server. Write commands can also be used but the text requires a lot of annoying quote marks. You may embed in the HTML line figures of the form:

        &!  and &~expression~
        

        The first of these, &!, causes a new line. The second causes evaluation of the expression and the result to be written to the web server (the &~ and ~ are not sent).

      Now open a browser and enter:

      
      127.0.0.1/cgi-bin/isr.html
      
      

      This will bring up the first screen shown below. Click Submit and the second screen will appear.

    7. The following code illustrates most of the major FORM data collection techniques:

      <form method="get" action="quiz2.cgi"> <center> Name: <input type="text" name="name" size=40 value=""><br> </center> Class: <input type="Radio" name="class" value="freshman" > Freshman <input type="Radio" name="class" value="sophmore" > Sophomore <input type="Radio" name="class" value="junior" > Junior <input type="Radio" name="class" value="senior" checked> Senior <input type="Radio" name="class" value="grad" > Grad Student <br> Major: <select name="major" size=7> <option value="computer science" >computer science <option value="mathematics" >Mathematics <option value="biology" selected>Biology <option value="chemistry" >Chemistry <option value="earth science" >Earth Science <option value="industrial technology" >Industrial Technology <option value="physics" >Physics </select> <table border> <tr> <td valign=top> Hobbies: </td> <td> <input type="Checkbox" name="hobby1" value="stamp collecting" > Stamp Collecting<br> <input type="Checkbox" name="hobby2" value="art" > Art<br> <input type="Checkbox" checked name="hobby3" value="bird watching" > Bird Watching<br> <input type="Checkbox" name="hobby4" value="hang gliding" > Hang Gliding<br> <input type="Checkbox" name="hobby5" value="reading" > Reading<br> </td></tr> </table> <center> <input type="submit" value="go for it"> </center> </form>

      which produces:

      You can see how it looks by clicking here Click here for a good short list of HTML commands.

    8. Write a program to compute an optimal binary tree. See this link

      read "n " n for i=1:1:n do . write "p",i," " . read p(i) for i=0:1:n do . write "q",i," " . read q(i) for i=0:1:n do . for j=0:1:n do .. set r(i,j)=0 for i=0:1:n do . set c(i,i)=0 . set w(i,i)=q(i) . for j=i+1:1:n do .. if j'>n set w(i,j)=w(i,j-1)+p(j)+q(j) for j=1:1:n do . set c(j-1,j)=w(j-1,j),r(j-1,j)=j for d=2:1:n do . for j=d:1:n do .. set i=j-d,y=r(i,j-1) .. set x=c(i,y-1)+c(y,j) .. do xx .. set c(i,j)=w(i,j)+x,r(i,j)=y write !,"matrix",! for m=0:1:n-1 do . write ! . for l=1:1:n do .. write r(m,l)," " write !,! set s=1 set s(s)=0_","_n set c=1 set nx=2 set a(1)="b(0" y if $piece(s(c),",",1)-$piece(s(c),",",2)=0 do . set c=c+1 . if c<nx goto y . goto z set s(nx)=$piece(s(c),",",1)_","_(r(@s(c))-1) set a(nx)=a(c)_",1" set nx=nx+1 set s(nx)=r(@s(c))_","_$p(s(c),",",2) set a(nx)=a(c)_",2" set nx=nx+1 set c=c+1 goto y z for i=1:1:c-1 do . set a(i)=a(i)_")" for i=1:1:c-1 do . write a(i),!,s(i),! . set @a(i)=r(@s(i)) for i=1:1:c-1 do . write !,a(i),"->",@a(i) halt xx for k=r(i,j-1):1:r(i+1,j) do . if c(i,k-1)+c(k,j)<x do .. set x=c(i,k-1)+c(k,j) .. set y=k quit which when given input: n 7 p1 2 p2 3 p3 2 p4 4 p5 2 p6 3 p7 2 q0 1 q1 1 q2 1 q3 1 q4 1 q5 1 q6 1 q7 1 writes: matrix 1 2 2 2 4 4 4 0 2 2 3 4 4 4 0 0 3 4 4 4 4 0 0 0 4 4 5 6 0 0 0 0 5 6 6 0 0 0 0 0 6 6 0 0 0 0 0 0 7 b(0) 0,7 b(0,1) 0,3 b(0,2) 4,7 b(0,1,1) 0,1 b(0,1,2) 2,3 b(0,2,1) 4,5 b(0,2,2) 6,7 b(0,1,1,1) 0,0 b(0,1,1,2) 1,1 b(0,1,2,1) 2,2 b(0,1,2,2) 3,3 b(0,2,1,1) 4,4 b(0,2,1,2) 5,5 b(0,2,2,1) 6,6 b(0,2,2,2) 7,7 b(0)->4 b(0,1)->2 b(0,2)->6 b(0,1,1)->1 b(0,1,2)->3 b(0,2,1)->5 b(0,2,2)->7 b(0,1,1,1)->0 b(0,1,1,2)->0 b(0,1,2,1)->0 b(0,1,2,2)->0 b(0,2,1,1)->0 b(0,2,1,2)->0 b(0,2,2,1)->0 b(0,2,2,2)->0

    9. Dump/restore and data base compression

      As is the case with many data base systems, once disk blocks have been allocated, they remain as permanent parts of the file system, even if, due to deletions, they are no longer needed. In some system, this results in an accumulation of unused blocks. In a B-tree based system such as used in Mumps, block occupancy can vary considerably after many deletions and reorganizations. In order to remove unused blocks and rebuild the B-tree with blocks that are mostly half filled, the data base should be dumped to a sequential, collated ASCII file, the old data base (key.dat and data.dat) erased and then the data base restored from the ACSII file.

      There are two functions in Mumps to accomplish this: $zcd() and $zcl(). The first of these, $zcd() writes the full data base to disk as an ASCII file. If given a string parameter, it will use the contents of the string as the file name. If no file name is given, the default will be the system time in seconds followed by ".dmp". The second function, $zcl() restores the data base. If given a file name parameter, it will load from the file specified. If no parameter is given, it will look for a file named "dump".

      For example, in a large run of 25,000 abstracts which included creation and pruning of the ^doc(), ^index(), ^idf(), ^mca(), ^df() and ^dict() vectors as well as creation of ^tt() and ^dd() matrices, the global array data base was:

      -rw-rw-rw-    1 root     root          19M Mar  5 04:40 /d1/isr/code/data.dat
      -rw-rw-rw-    1 root     root         262M Mar  5 04:40 /d1/isr/code/key.dat
      

      After a dump/restore cycle it was:

      -rw-rw-rw-    1 root     root         8.5M Mar  5 09:52 data.dat
      -rw-rw-rw-    1 root     root         107M Mar  5 09:52 key.dat
      

      The intermediate dump file was 38M bytes in length. In this case, the dump/restore resulted in more than 2 to 1 in savings and, consequently, faster access due to fewer blocks searched. The following are the steps:

      run the program:
      
            #!/usr/bin/mumps
            #
            #     dump the data base
            #
                  set %=$zcd
      
      followed by the system command:
      
            mv 11100370.dmp dump
      
      which renames the dump data set, followed by the system commands:
      
            rm key.dat
            rm data.dat
      
      which delete the old data sets, followed by running the program:
      
            #!/usr/bin/mumps
            #
            #     restore the data base
            #
                  set %=$zcl
      
      which reloads and rebuilds the data base.
      
      

      Dump/restore routines can be used to create backup copies of a data base for later restoration. A dump/restore is generally very quick, taking only a few minutes (depending on file size). This is due to the relatively sequential nature of the B-tree load.

    10. Sorting from Mumps

      The easiest way to sort is to write out a file, close it, and then invoke the system "sort" program. For example, suppose you have a vector of words containing their frequency of occurrence (^dict) and you want to order them by frequency of occurrence. The vector itself is ordered alphabetically by word, the primary index. You can produce a list sorted by frequency with the following:

      #!/usr/bin/mumps open 1:"temp.dat,new" use 1 for w=$order(^dict(w)) do . write ^dict(w)," ",w,! use 5 close 1 set i=$zsystem("sort -n < temp.dat > temp1.dat") // -n means numeric if i!=0 do . write "sort failed",! . set %=$zsystem("rm temp.dat") . set %=$zsystem("rm temp1.dat") . halt open 1:"temp1.dat,old" for do . use 1 read line . if !$test break . use 5 write line,! use 5 close 1 set %=$zsystem("rm temp.dat") set %=$zsystem("rm temp1.dat")

      While it is possible to use global arrays to sort, it is generally a bad idea. The system sort program is much faster and more efficeint. The sort program has many options, such as the -n (numeric sort) shown above. These include the ability to sort ascending, descending and on multiple fields. See the documentation by typing "man sort" on a Linux or Unix system.


    1. Experimental Data Bases

      For some of the experiments below, the OHSUMED TREC-9 data base was used. The OHSUMED test collection is a set of 348,566 references from Medline, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The data base was filtered and reformated to conform to the style similar to that used by online NLM Medline abstracts. A compressed, filtered copy of the data base is here. The C++ filtering program is:

      // ==============================================================
      //#+ 
      //#+ Copyright (C) 2005 by Kevin C. O'Kane  
      //#+ 
      //#+ Kevin C. O'Kane, Ph.D.
      //#+ Computer Science Department
      //#+ University of Northern Iowa
      //#+ Cedar Falls, IA 50614-0507
      //#+ Tel 319 273 7322
      //#+ okane@cs.uni.edu
      //#+ http://www.cs.uni.edu/~okane
      //#+ 
      //#+ Consult individual modules for copyright details
      //#+ The runtime libraries are covered by the following license:
      //#+  
      //#+ This library is free software; you can redistribute it and/or
      //#+ modify it under the terms of the GNU Lesser General Public
      //#+ License as published by the Free Software Foundation; either
      //#+ version 2.1 of the License, or (at your option) any later version.
      //#+ 
      //#+ This library is distributed in the hope that it will be useful,
      //#+ but WITHOUT ANY WARRANTY; without even the implied warranty of
      //#+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
      //#+ Lesser General Public License for more details.
      //#+ 
      //#+ You should have received a copy of the GNU Lesser General Public
      //#+ License along with this library; if not, write to the Free Software
      //#+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
      //#+ 
      //#+==============================================================
      
      # cvtosu.cpp March 5, 2005
      
      #include 
      #include 
      
      int main() {
      char line[8192];
      long i,j,k;
      long acc=0;
      int first=1;
      
      while (1) {
            if (fgets(line,8192,stdin)==NULL) break;
            i=strlen(line);
            line[i-1]='\0';
            if (strncmp(line,".I",2)==0) {
                  if (!first) cout << endl;
                  first=0;
                  cout << "STAT- MEDLINE" << endl;
                  continue;
                  }
            if (strncmp(line,".T",2)==0) {
                  acc++;
                  fgets(line,8192,stdin);
                  i=strlen(line);
                  line[i-1]='\0'; // remove newline
                  j=0;
                  cout << "TI    " << line << endl;
                  continue;
                  }
            if (strncmp(line,".W",2)==0) {
                  fgets(line,8192,stdin);
                  cout << "AB    ";
                  i=strlen(line);
                  line[i-1]='\0'; // remove newline
                  j=0;
                  for (i=0; line[i]!=0; i++) {
                        cout << line[i];
                        j++;
                        if (j>50 && line[i]==' ') { cout << endl << "      "; j=0; }
                        }
                  cout << endl;
                  }
            if (strncmp(line,".M",2)==0) {
                  char word[512];
                  fgets(line,8192,stdin);
                  i=strlen(line);
                  line[i-2]='\0';
                  word[0]=0;
                  while (1) {
                        for (j=0; line[j]!=0; j++) {
                              if (line[j]==';') break;
                              word[j]=line[j];
                              }
                        word[j]=0;
                        cout << "MH    " << word << endl;
                        if (line[j]==0) break;
                        j++;
                        if (line[j]==0) break;
                        while (line[j]==' ') j++;
                        strcpy(line,&line[j]);
                        if (line[0]==0) break;
                        }
                  }
            }
      return  1;
      }
      
      
      which produces text of the format:
      
      STAT- MEDLINE
      MH    Acetaldehyde/*ME
      MH    Buffers
      MH    Catalysis
      MH    HEPES/PD
      MH    Nuclear Magnetic Resonance
      MH    Phosphates/*PD
      MH    Protein Binding
      MH    Ribonuclease, Pancreatic/AI/*ME
      MH    Support, U.S. Gov't, Non-P.H.S.
      MH    Support, U.S. Gov't, P.H.S.
      TI    The binding of acetaldehyde to the active site of ribonuclease: alterations in catalytic activity and 
      AB    Ribonuclease A was reacted with [1-13C,1,2-14C]acetaldehyde
            and sodium cyanoborohydride in the presence or absence
            of 0.2 M phosphate. After several hours of incubation
            at 4 degrees C (pH 7.4) stable acetaldehyde-RNase adducts
            were formed, and the extent of their formation was
            similar regardless of the presence of phosphate. Although
            the total amount of covalent binding was comparable
            in the absence or presence of phosphate, this active
            site ligand prevented the inhibition of enzymatic activity
            seen in its absence. This protective action of phosphate
            diminished with progressive ethylation of RNase, indicating
            that the reversible association of phosphate with the
            active site lysyl residue was overcome by the irreversible
            process of reductive ethylation. Modified RNase was
            analysed using 13C proton decoupled NMR spectroscopy.
            Peaks arising from the covalent binding of enriched
            acetaldehyde to free amino groups in the absence of
            phosphate were as follows: NH2-terminal alpha amino
            group, 47.3 ppm; bulk ethylation at epsilon amino groups
            of nonessential lysyl residues, 43.0 ppm; and the epsilon
            amino group of lysine-41 at the active site, 47.4 ppm.
            In the spectrum of RNase ethylated in the presence
            of phosphate, the peak at 47.4 ppm was absent. When
            RNase was selectively premethylated in the presence
            of phosphate, to block all but the active site lysyl
            residues and then ethylated in its absence, the signal
            at 43.0 ppm was greatly diminished, and that arising
            from the active site lysyl residue at 47.4 ppm was
            enhanced. These results indicate that phosphate specifically
            protected the active site lysine from reaction with
            acetaldehyde, and that modification of this lysine
            by acetaldehyde adduct formation resulted in inhibition
            of catalytic activity.
      
      STAT- MEDLINE
      MH    Adult
      MH    Alcohol, Ethyl/*AN
      MH    Breath Tests/*
      MH    Human
      MH    Irrigation
      MH    Male
      MH    Middle Age
      MH    Mouth/*
      MH    Temperature
      MH    Water
      TI    Reductions in breath ethanol readings in normal male volunteers following mouth rinsing with water 
      AB    Blood ethanol concentrations were measured sequentially,
            over a period of hours, using a Lion AE-D2 alcolmeter,
            in 12 healthy male subjects given oral ethanol 0.5
            g/kg body wt. Readings were taken before and after
            rinsing the mouth with water at varying temperatures.
            Mouth rinsing resulted in a reduction in the alcolmeter
            readings at all water temperatures tested. The magnitude
            of the reduction was greater after rinsing with water
            at lower temperatures. This effect occurs because rinsing
            cools the mouth and dilutes retained saliva. This finding
            should be taken into account whenever breath analysis
            is used to estimate blood ethanol concentrations in
            experimental situations.
      
      STAT- MEDLINE
      MH    Alcoholism/*PP
      MH    Animal
      MH    Diprenorphine/PD
      MH    Female
      MH    Morphine/*PD
      MH    Naloxone/PD
      MH    Naltrexone/PD
      MH    Narcotic Antagonists/*PD
      MH    Rats
      MH    Rats, Inbred Strains
      MH    Receptors, Endorphin/*DE/PH
      MH    Seizures/PP
      MH    Substance Withdrawal Syndrome/PP
      TI    Does the blockade of opioid receptors influence the development of ethanol dependence?
      AB    We have tested whether the opioid antagonists naloxone
            (2 mg/kg), naltrexone (2 mg/kg) and diprenorphine (0.2
            mg/kg), and the agonist morphine (4-8 mg/kg) given
            subcutaneously (10 min before ethanol for 7 days) modify
            the ethanol withdrawal syndrome (audiogenic seizures)
            following chronic ethanol intoxication in rats. We
            found that naloxone, naltrexone and diprenorphine modified
            the ethanol withdrawal syndrome. These findings do
            not rule out the possibility of a biochemical link
            between the action of ethanol and opiates at the level
            of opioid receptors.
      

      This file is used in some of the programs shown below.

      Additionally, a modified version of the basic OSU Medline file http://www.cs.uni.edu/~okane/source/ISR/medline.translated.txt.gz is also used in the experiments. In this version:

      • each document is on one line;
      • each line begins with the token xxxxx115xxxxx
      • following the beginning token and separated by on blank is the offset in bytes of the start of the abstract entry in the long form of the file shown above;
      • next follows, separated by a blank, the document number;
      • the remainder of the line are the words of the document processed according to:
        • words shorter than 3 or longer than 25 letters are deleted;
        • all words are reduced to lower case;
        • all non-alphnumeric punctuation is removed;
        • the words have been processed by a basic stemming procedure leaving only the roots;

    2. Vocabularies

      Traditionally, indexing has been conducted manually by experts in a subject who read each document and classify it according to content. Increasingly, manual indexing is being overtaken by automated indexing, of the kind performed by Google and other online indexing and information storage and retrieval systems.

      In any indexing scheme, tehre is a distinction between a "controlled" and "uncontrolled" vocabulary scheme. A "controlled" vocabulary indexing scheme is one in which previously agreed upon standardized terms, categories and hierarchies are employed. On the other hand, an "uncontrolled" vocabulary based system is one that derives these from the text directly.

      In a controlled vocabulary based system, subjects are described using the same preferred term each time and place they are indexed, thus insuring uniformity across user populations and making it easier to find all information about a specific topic during a search. Many controlled vocabularies exist in many specific fields. These take the form of dictionaries, hierarchies, and thesauri which structure the content of the underlying discipline into commonly accepted categories. For the most part, these are constructed and maintained by government agencies (such as the National Library of Medicien in the U.S. or professional societies such as the ACM.

      For example, the Association for Computing Machinery Computing Classification System (1998) which is used to classify documents published in computing literature. This system is hierarchical and invites the author or reviewer of a document to place the document under those categories to which the document most specifically applies and at the level in the tree that best corresponds to the generality of the document. For example, consider the following extract of the ACM system:

      Copyright 2005, by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
      # D.4 OPERATING SYSTEMS (C) * D.4.0 General * D.4.1 Process Management o Concurrency o Deadlocks o Multiprocessing/multiprogramming/multitasking o Mutual exclusion o Scheduling o Synchronization o Threads NEW! * D.4.2 Storage Management o Allocation/deallocation strategies o Distributed memories o Garbage collection NEW! o Main memory o Secondary storage o Segmentation [**] o Storage hierarchies o Swapping [**] o Virtual memory * D.4.3 File Systems Management (E.5) o Access methods o Directory structures o Distributed file systems o File organization o Maintenance [**] * D.4.4 Communications Management (C.2) o Buffering o Input/output o Message sending o Network communication o Terminal management [**] * D.4.5 Reliability o Backup procedures o Checkpoint/restart o Fault-tolerance o Verification * D.4.6 Security and Protection (K.6.5) o Access controls o Authentication o Cryptographic controls o Information flow controls o Invasive software (e.g., viruses, worms, Trojan horses) o Security kernels [**] o Verification [**] * D.4.7 Organization and Design o Batch processing systems [**] o Distributed systems o Hierarchical design [**] o Interactive systems o Real-time systems and embedded systems * D.4.8 Performance (C.4, D.2.8, I.6) o Measurements o Modeling and prediction o Monitors o Operational analysis o Queueing theory o Simulation o Stochastic analysis * D.4.9 Systems Programs and Utilities o Command and control languages o Linkers [**] o Loaders [**] o Window managers * D.4.m Miscellaneous

      Numerous other examples abound, especially in technical disciplines where nomenclature is precise. For example:

      For a very long list, see American Society of Indexers Thesauri Online

      In a manually indexed collection that uses a controlled vocabulary, experts trained in the vocabulary read and assign vocabulary or hierarchy codes to the documents. Historically, because of the complexity of the terminology and the expense of conducting online searches, these systems were accessed by trained personnel who intermediated user's needs and expressed them in the precise vocabulary of the discipline. Prior to the advent of the internet, online database searching was expensive and time consuming. In recent years, however, with the advent of ubiquitous internet access and vastly cheaper computer facilities, the end user is more likely to conduct a search directly. data bases are increasingly queried directly by the end user.

      Uncontrolled or derived vocabulary systems have been around for many years. These derive their terms directly from the text. Among the earliest forms were bibilcal concordances such as:

      Manual construction of concordances is tedious at best but well suited as a computer application. A computer based uncontrolled or derived vocabulary can be constructed through numerical analysis of word usage in the collection as a whole. On the other hand, controlled vocabularies may also be used in computer based systems with the aid of training sets of documents.

    3. Zipf's Law

      Zipf's Law states that the frequency ordered rank of a term in a document collection times its frequency of occurrence is approximately equal to a constant:

      Frequency * rank ~= constant

      where Frequency is the total number of times some term k occurs. Rank is the position number of the term when the terms have been sorted by Frequency. T