Journal of the Association for Information Science

 

Bert R. Boyce
 

689
 

 RESEARCH

 

An Evaluation of Retrieval Effectiveness Using Spelling-Correction and String-Similarity Matching Methods on Malay Texts
Zainab Abu Bakar, Tengku Mohd T. Sembok, and Mohammed Yusoff

We begin this issue with Bakar et alia's evaluation of string matching methods on Malay texts. Much of current post 1960 Malay text is in the Rumi alphabet, a romanised system based on English phonemes. English conflation algorithms can be used effectively. Because of prefixes and infixes stemming alone is not effective, and the addition of n-gram matching is required. Using a data set with 5085 unique Malay words and 84 query words, eight phonetic code lists were created using four coding methods from stemmed and not stemmed dictionaries. One hundred words surrounding a matched key are chosen, equally above and below unless too close to the top or bottom of the list. Stemming proves to be very helpful, as does phonetic coding. It seems that smaller key sizes perform better. Diagram, an existing string matching algorithm, gave the best relevant and retrieved results.
 

691
 

 

 

 

 

 


 

 

Managing Heterogeneuous Information Systems through Discovery and Retrieval of Generic Concepts
Uma Srinivasan, Anne H.H. Ngu, and Tom Gedeon

Within application domains users with common objectives create heterogeneous databases to store and manage similar data types. Usage patterns indicate the knowledge of the users. The notion, for Srinivsan et alia, is to create a ``middle layer'' of concepts extracted from similar patterns in existing systems and from the use of these systems, which can wrap the existing databases and provide a common access mechanism. Entities defined in existing systems as sets of variables, are extracted and classed using similarity measures based on commonality in structure and use patterns. Those classed together represent a common application specific generic concept.

For each class user group pair a ``group data object'' is created. A tree of ``group data objects'' that represents user types at different levels of specificity is generated from user supplied terms and query extracted terms from each user type. A user is mapped into a user type and then the appropriate group data objects are generated and their labels displayed to the user for selection. Selection generates the extractors from each database for that user type in that group data object. Three medical databases clustered yielded eight concept classes and multiple user objects were created. Tests showed varied query production in the same concept classes for the various groups
.

707
 

 

 

 

 

 

 

 

 

 

 

 

Raising Reliability of Web Search Tool Research through Replication and
Chaos Theory

Scott Nicholson

After reviewing the literature of evaluative web search tool research, Nicholson replicates the 1996 Ding and Marchionini search service study ten times during the Summer of 1998. Previous work finds replication yields significantly different results over time. The first twenty pages returned by Infoseek, Lycos, Alta Vista and Excite for the five queries were examined and ranked between 0 and 5 for relevance. Differing engine rankings for each replication are the rule. Using two queries, one designed to have a stable answer and another a dynamic answer over time, the four systems were tried again on five successive weeks. New pages appearing in the first 20 pages in each successive week were counted, as were pages that changed ranked position. Both queries showed considerable change week to week. The results were aggregated and the frequency of the engine with the highest number of relevant documents found to show a replicable pattern over all weeks, the odd weeks, and the even weeks. This pattern provides a clear ranking of the five engines, which was not determinable from the individual replications.
 

724
 

 

 

 

 

 

 

 

 

 

The Personal Construction of Information Space
Cliff McKnight

An information space, according to McKnight, is just the objects, real or virtual, an individual uses to acquire information. A repertory grid is a means of externalizing a person's view of the world where a triad of elements is presented and the subject asked to find how two are the same and the third different. The focus that makes this possible is given a rating scale with extremes for both poles, and called a construct. Multiple constructs with element ratings provide an individuals view of a domain. Eleven information sources were elicited from a University lecturer and presented as triads. Ten constructs were elicited and the elements rated on the constructs. A cluster analysis reorders the grid so similarly rated elements and similarly used constructs are adjacent. Both construct and element clusters seem to make sense and likely reflect the subject's views of his information space. It remains to be seen if parts can be shared with other subjects.
 

730
 

 

 

 

 

 


 

 

Time-Line Interviews and Inductive Content Analysis: Their Effectiveness
for Exploring Cognitive Behaviors
Linda Schamber  

Schamber uses her weather information data collected by time-line interview techniques and content analysis to address the effectiveness of these techniques. By soliciting a sequence of events where weather information was needed and sought, and soliciting the one event in the sequence where information was most actively sought, the key event, and those before and after it could be studied in some detail. The time-line technique provides an unobtrusive means of collecting data on perceptions and yields rich data. It is, however, a labor intensive method. The content analysis was also unobtrusive and effective, but also very labor intensive. In this framework criteria are best defined from user's perceptions, which are indicated with validity from self reports.
 

734
 

 

 

 

 

 

 

 

Abstracts Produced Using Computer Assistance
Timothy C. Craven

Craven evaluates abstracts produced with the assistance of TEXNET, an experimental system which provides the abstractor with text words and phrases extracted by frequency after a stop-list pass. Three texts of approximately 2000 words each were chosen and for each text a set of 20 different subjects drawn by advertisement within a University community created abstracts using TEXNET. Half got a display of keywords occurring eight or more times, and half got a display of phrases of the same occurrence. All subjects were surveyed as to background and reaction to the software, provided with a demonstration of the software, and told their abstract should not exceed 250 words. Nine of these, including the author abstract, were read by three raters again recruited by advertisement. Analysis shows no correlation between keywords or phrases and quality ratings or usefulness judgements by subjects. Experience did not lead to conciseness, originality or approximation of the author abstract. Female gender correlated positively with length and use of words from the text. Subjects wanted to view text and emerging abstract simultaneously, easy scrolling, standard black on white screens, a dynamic word count and spell checker.
 

745
 

 

 

 

 

 

 

 


 

 

Encounters with the OPAC: On-Line Searching in Public Libraries
Deborah J. Slone

Slone looks at the behavior of OPAC users conducting known item, area (broad search with most refinement off line), and unknown item searches in a public library. Thirty six participants, who approached the terminals and agreed, answered a pre-search questionnaire on OPAC experience, reason for coming, and length of time spent planning their search. They were then observed, and their searching terms, comments, reactions, age, gender, time on line, and outcome logged. Feelings were inferred from observation and noted except that confidence level was solicited in the questionnaire. Twenty eight began confident, but only 14 displayed confidence during their search. Successful unknown item searchers began broadly, and focused with terms selected form initial results. Area searchers searched broadly for a general area and focused at the shelves using minimal computer resources. Known item searches were quickly effective at the terminal. Frustration, anxiety and disappointment abounded during unknown item searches.
 

757
 

 

 

 

 

 

 

 

 

Using Clustering Strategies for Creating Authority Files
James C. French, Allison L. Powell, and Eric Schulman

When disparate bibliographic databases are integrated different authority conventions prevent physical combination and require a mapping that hides the heterogeneity from users. French, Powell, and Schulman advance automated techniques for the assistance of those maintaining authority for author affiliations in the Astrophysics Data System. Strings were extracted, clustered, reviewed by a domain expert and iterated to a final form. Concentration was on an ideal set of 38 institutions represented by 1,745 variant strings, with a goal of properly clustering these while excluding instances of the other 12,139 identified strings in the ideal clusters. First a lexical cleanup was run removing uppercase, country designations at the end of a string, as well as ZIP codes, and state abbreviations, and expanding a list of abbreviations. Then string and frequency of occurrence pairs are sorted and beginning with the most common string its distance to all other strings is computed and those exceeding a threshold are clustered with the most common item and removed from consideration. The process is iterated until the file is exhausted. Tested distance measures are: edit distance i.e. the number of four simple operations required for transforming one string to another, edit distance with words rather than characters, and the Jaccard coefficient. Allowing the threshold to be some fraction of the length of the shorter string improves results over a fixed threshold but higher thresholds required to cluster all variants still result in significant errors. Required human effort rises with the number of misplaced strings but such effort is reduced roughly in half by the clustering procedure.
 

774
 

 

 

 

 

 

 

 

 

 


 

 BOOK REVIEWS

 

Inventing the Internet, by Jane Abbate
Cheryl Knott Malone
 

787

 

 

Internet Policy Handbook for Libraries, by Mark Smith
   Janie L. Hassard Wilkins

788
 


2000 , Association for Information Science