Journal of the Association for Information Science and technology

 

 Bert R. Boyce
 

369
 

 RESEARCH

 

A Noninformetric Analysis of the Relationship between Citation Age and Journal Productivity
L. Egghe
Published online 1 February 2001

    In this issue Egghe provides an explanation based upon the central limit theorem for a regularity observed by Wallace between citation age and journal productivity which implies that the there is no informetric explanation, but rather the observation of a statistical effect. He then examines the Leimkuhler curve, showing the arcs at the tail to be a mathematical rather informetric artifact. The relationship between the fraction of multinational publications of a country and the country's fractional score is also shown to be probabilistic in nature. However, the relationship between the Price index and median age requires both probabilistic and informetric explanation, and the cumulative first citation distribution seems best explained with a curve incorporating Lotka's exponent and thus has high informetric value.

371

 

 

 

 

 

 

 

 

Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text
YuenHsien Tseng
Published online 5 February 2001

       We also include four papers concerning the automatic characterization of documents and queries. First, using a test collection of 7990 OCR scanned book pages from 500 books in four languages, and 30 queries, 15 content and 15 known item, Tseng applies variable length ngram indexing and byte size normalization. Document terms are weighted at 1 plus the log of term frequency except for those on the first two pages of a book. These are incremented by eight not one. Each occurrence of a query term increments it weight by 1 plus the cube of the ngram length minus1. Known item searches are limited to the first two pages of each book. Precision and recall results achieved second place in a contest entered. A similar approach has promise with Chinese text.

378

 

 

 

 

 

 

 

 

 

An Experimental Study in Automatically Categorizing Medical Documents
Berthier RibeiroNeto, Alberto H.F. Laender, and Luciano R.S. de Lima
Published online 5 February 2001

   In another automatic characterization paper RibeiroNeto, et alia, test their coding algorithm which assigns International Code of Diseases category codes to medical documents against a file of 20,569 patient records. The ICD codes are represented as a directed acyclic graph, and supplemented with acronym and synonym dictionaries for the codes. For each section of each document the acronyms and synonyms are converted to code strings and root node codes are identified. A window of document terms around each root node term is created and the longest path from the graph including these terms is extracted. These codes are assigned to the document in a ranked order by relative path length for that root.

   Using documents with specialists assigned ICD codes as an ideal set, 19,651 were categorized at between 70 and 80% for all recall levels, while 918 were not. However, specialists made incorrect assignments in 589 documents, and in 391 made assignments not supported by the text, but that may have been the result of additional information. In only 158 cases was the algorithm clearly incorrect.

    391

 

 

 

 

 

 


 

 

 

 

Automatic Query Expansion via LexicalSemantic Relationships
Jane Greenberg
Published online 9 February 2001

  Next, using 42 queries, in the form of Boolean statements with free text terminology, collected from MBA students and the ABI/Inform database, Greenberg maps against the ProQuest Controlled Vocabulary selecting those queries that contained at least one ProQuest term. These were searched in initial form, a form mapped from ProQuest, and using expansions that took all synonyms, all narrower terms, all broader terms, and all related terms. Greenberg conducted all searches on Dialog and subtracted the initial and mapped results form the other returns to gauge the expansions effectiveness. Relevance judgements were made on the basis of topical matching (aboutness) by the contributors of the queries reviewing the Union set of the responses to the query forms where each retrieved list was limited to a length 15 or less citations. If the retrieved set was under 16 all were presented, and if between 16 and 100 the top 15 ranked by similarity to the query (Dice Coefficient) were used, while if above 100 a random sample of size one hundred was used for the similarity ranking. Broader terms and Related terms each improved recall nearly 100%, while Narrower terms increased the baseline from .266 to .473. Synonyms improved from the .226 base to .369. The baseline precision of .794 was reduced to .766 by the use of synonyms, to .733 by the use of narrower terms, .544 by the use of related terms, and .595 by the use of Broader terms.

   The log kept by the Melvyl http daemon were edited, and tabulated using SAS. In the 2.5 million sessions, there were 3.6 million pre-search activities, 7.4 million searches, 13 million displays, 11 million other activities and 60,000 help requests. Spiders accounted for 27% of the sessions, tourists 11%, and the remaining 62% were real sessions. While tourist and spider sessions are distributed relatively uniformly, real sessions display date and time sensitivity, peaking on Tuesday between two and three PM, and bottoming on Saturday. The average session length is 10.3 minutes, with the length of real sessions gradually increasing over time, and a standard deviation of nearly 18 minutes. Users spend about 25 seconds on pre-search, and 36 seconds on each of the other classed activities,
although each of these is preformed with different frequencies within a session. The catalog and Medlars databases together account for 54% of use, the Magazine database for another 10%: others all less than 10% each. The time spent viewing the results is relatively constant over databases. Help actions are evenly distributed as to time and when normalized by number of uses as to database. .

    402

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

 

Modeling User Interest Shift Using a Bayesian Approach
Wai Lam and Javed Mostafa
Published online 1 February 2001

   In a different approach to query modification Lam and Mostafa address information filter modification in response to changing user needs. Such filters assume a stability of user need, when in fact, information needs evolve at unpredictable speeds and in unpredictable ways adding to the normal relevance assessment problems. Their passive filter stores and ranks material received for later review, building its profile from a subset of MeSH headings based on user relevance feedback assessments of documents presented. Documents are classed using a cosine similarity measure, the user provides binary interest weights for each class, and their running average is maintained as the relevance probability of the class which is used to rank all classes after the first. Positive feedback also modifies a second vector used to select the initial class, providing a means of relearning of changing interests. Since this relearning requires considerable iterations, with degraded interim results, a means of quick shift detection is needed. Using the sequence of feedback data and Bayes theorem, with associated costs of a wrong decision, the posterior probability that a shift has occurred can be computed. An upward shift will result in the new class and the old most probable class each being assigned half of the sum of their probabilities. A downward shift of the most probable class will use the user profile vector to identify the class weights to sum and distribute, since the class vector values of the other classes will be near zero. Simulation studies indicate that the system is able to recognize and correct for interest shifts.
 

416
 

 

 

 

 

 

 

 

 

 

 

 

 

General Purpose Compression for Efficient Retrieval
Adam Cannane and Hugh E. Williams
Published online 5 February 2001

   The final paper in this issue is concerned with compression techniques that can speed up retrieval since disc seek and transfer cost savings can exceed decompression costs. Cannane and Williams describe an algorithm that identifies unique character strings occurring at least twice by way of multiple passes, replaces them with a reference number, and continues to form a hierarchy of longer strings that may contain references to shorter ones. The process terminates when no further duplicate strings are to be found. The representation created and an associated string dictionary allow decompression at any random access point. Using the Canterbury collection for compression experiments, and the TREC Wall Street Journal and WEBDOC files, and databases of genomic records, weather data, and geographic data, compression is found to be superior to GZIP, COMPRESS, and the Huffman coding scheme, but not as effective as BZIP2, although decompression is faster than BZIP2..

    430
 

 

 

 

 

 

 

 

 

 

 
     
     
     

 BOOK REVIEWS

 

Digital Capital: Harnessing the Power of Business Web, by Don Tapscott, David Ticoll, & Alex Lowy
Shana R. Ponelis

        438

  

 

A Place at the Table: Participating in Community Building, by Kathleen de la Pena McCook
Marianne Orme

439

 

 

 

 
     

2001 , Association for Information Science and Technology