Journal of the Association for Information Science and Technology

Index
Table of Contents

Volume 55  Issue 3


 

In This Issue

187

 

In this issue
Bert Boyce
 

 

Research

189

 

 

 

 

 

 

Arabic Morphological Analysis Techniques A Comprehensive Survey
Imad A. Al-Sughaiyer and Ibrahim A. Al-Kharashi
Published online 20 November 2003

Al-Sughaiyer and Al-Kharashi provide definitions of standard linguistic terms as they are seen in Arabic analysis and identify efficiency, compactness, bi-directionality, success rate, and retrieval performance as the measures of the effectiveness of morphological analysis algorithms. After a review of the Arabic morphological analysis literature, they suggest the approaches may be classified as table lookup (large construction demands and space requirements), linguistic (require a large number of lists and removing affixes by trial and error), combinatorial (large space and time requirements), or rule based (the authors' choice), and they present a summary of the work in each area. The majority of work is linguistic in nature, but little comparison of existing work exists. Evaluation of suggested algorithms is weak. Use of a word's root (the single basic morpheme) in an Arabic index leads to invalid conflation.
 

214

 


 

 

 

 

 

 

 

 

Predicting Library of Congress Classifications From Library of Congress Subject Headings
Eibe Frank and Gordon W. Paynter
Published online 28 October 2003

Frank and Paynter attempt to assign LC Classification number ranges to INFOMINE documents based on their assigned LCSH headings in order to provide a better browsing capability. Since they claim that retrospective assignment of a class number is logistically impossible for those librarians that already assign terms from LCSH to INFOMINE records, they have devised a machine-learning technique to create the classification number where, rather than creating virtual LCSH documents to represent each LC class and using similarity measures to assign documents, they use a support vector machine classifier to determine which of the top 21 nodes is most likely and classifiers at each successive level until a leaf is reached or the classifier chooses itself. They utilize LCSH terms without subdivisions, and also make use of intervals from the LCC outline available on the Pharos Web site, both processed and extracted from existing MARC records to create a training set. The training set of 868,836 records was drawn from the UC Riverside library catalog with 50,000 items reserved for testing and the remainder used for training at different levels. Accuracy increases with training set size but returns diminish. Accuracy increases from 32% to 55% as the training set size increases from 10,000 to 800,000. Less than 7% of errors are due to the classifier terminating too early or too late. With the large test set 80% of the first array classification decisions are correct and 16% at the seventh level. The learning algorithm scales at the order of n instances to the 1.7, and test processing proceeds at a rate of 21 instances per second.
 

228

 

 

 

 

 

 


 

 

 

 

A Nonlinear Model of Information-Seeking Behavior
Allen Foster
Published online 11 November 2003

Foster disagrees with the conventional wisdom view of information seeking as a linear process of identifiable stages and iteration, particularly as it would apply to interdisciplinary information-seeking behavior, and he attempts a non-linear model based on identifying the processes, contexts, and behaviors of such interdisciplinary activity and their relationships. In-depth structured interviews conducted in the workplace environment were utilized to collect data on searching examples provided by the subjects. Subjects were purposively selected from the University of Sheffield across multiple faculties for their interdisciplinary research and then used as the kernel for a snowball sampling expansion resulting in 45 faculty-diverse participants. Transcript coding took place in multiple iterations and the final results were confirmed by participant review. Activities viewed in conjunction with time lines did not support a linear stages model. The new model groups activities into three core categories; "opening" (moving from orientation to actual search), "orientation" (identifying existing research and a direction for search), and "consolidation" (refining and knowing when to stop). These operate with the boundaries of an "external context" which incorporates social and organizational influences, time, project and access constraints, and navigational issues. Within the external context one finds an internal context which incorporates the individual's experience, prior knowledge, and feelings and which are individually unique. Four cognitive approaches were identified flexible and adaptable to other cultures, an open approach with no prior framework, a nomadic approach which actively seeks diverse ways of access, and a holistic approach which attempts to bring diverse areas together. Interaction among the core activities was cumulative, reiterative, holistic, and context-bound.
 

238

 


 

 

 

 

 

 

 

Indicators of Accuracy for Answers to Ready Reference Questions on the Internet
Martin Frick‚ and Don Fallis
Published online 19 November 2003

Frické and Fallis explore the validity of proposed indicators of the accuracy of ready reference information to be found in Web sites. Using 49 of the 60 questions previously used by Connell and Tipple, AltaVista searches were run to identify potential answer sites, the first five of which actually answered the question chosen, and then evaluated for answer accuracy and checked for the presence of indicators of accuracy. This was followed by a Google search to yield these and at most five additional sites. Each site was manually scored as completely accurate, partially accurate, partially inaccurate, or completely inaccurate and checked for owner entity type, recency of update, presence of advertising, copyright claim, appeal to authority, and the presence of any awards for quality, as well as its ranked position by the search engine, its Google PageRank (0­10) position, and the number of in-links found with the AltaVista link command. Contingency tables were formed and chi-square used to determine possible correlation. Likelihood ratios for presence and absence of indicators and indicator pairs were also computed. Of 300 sites that answered the questions, 214 were judged completely accurate and only 25 inaccurate. High display position, high Google PageRank, currency, copyright and in-link count all yield a chi-square probability of less than .05, suggesting a relationship to accuracy.
 

246

 

 

 

 

 


 

 

 

 

The Effects of Domain Knowledge on Search Tactic Formulation
Barbara M. Wildemuth
Published online 13 November 2003

Wildemuth is interested in whether a growing understanding of the knowledge domain covered by a database will affect the sequence of searching moves (tactics) used by medical students searching that database. Two random samples were drawn from entering medical school classes, excluding those with advanced science degrees and those whose undergraduate degree was in microbiology, the topic of the database. Each was asked to address six specific clinical problems involving several specific questions; first, prior to any instruction in microbiology, resulting in a 12.6% success rate; second, directly after the microbiology course, resulting in a 48.1% success rate; and finally, six months after the course, achieving a 27.3% success rate. In each instance subjects were asked to respond from their own knowledge and then to search the database for a question for which they had provided an incorrect response. The nearly 1,300 searches were recorded by transaction logs and hand coded according to an adaption of the Shute & Smith scheme incorporating beginning moves, reduction moves, expansion moves, and term replacement. A transition matrix showing the frequency of transitions from one coded move to every other coded move was created and used to create a graphic representation of transitions that accounted for at least 1% of all occurrences. Maximal repeating patterns of moves were also extracted and the most frequently occurring retained. The most common pattern was the entry of a new concept followed by the addition of one or more concepts prior to display. Number of moves decreased with experience. Database usage increases performance at all three levels of experience.
 

259

 

 

 

 

 

 

 

 

 

 

 

A Graph Model for E-Commerce Recommender Systems
Zan Huang, Wingyan Chung, and Hsinchun Chen
Published online 14 November 2003

Huang, Chung, and Chen are interested in maximizing the value of the product and usage information available from online transactions for those that supply material and those that interact with that supplied material. Such information needs to be represented in a flexible manner, since different recommendation approaches are typically used to create recommender systems that find associations between users and items and use discovered associations to recommend additional items to previous users. A two-layer graph model is implemented with users and items as nodes in separate layers and transactions and similarities as links. Nodes are kept as relative similarity measures to other nodes. If links in the item layer are activated, the approach is content-based. If user and inter-layer links are activated, the approach is collaborative and activating all links gives a hybrid approach. A direct retrieval approach retrieves items similar to those used previously by a user or similar users. A collaborative recommendation forms a list of similar users by either past common item selection or by common demographics and recommends that list's past selections. An association mining method was used with the three approaches, each generating a different set of association rules with transitive rules in a Hopfield net utilized as an option to overcome sparse user ratings. Testing on a Chinese online bookstore data set provided records for 9,695 books, 2,000 customers, and 18,771 transactions. Books and customers were described as feature vectors and similarity measures computed and customers' purchase lists were halved to provide a predicted set from the first half allowing recall and precision-type measures. Pairwise t-tests were then applied. The hybrid approach was the best performer, but the spreading activation approach did not better significantly the associative mining approach or direct search.
 

 

Book Reviews

275


 

Mining the Web Discovering Knowledge From Hypertext Data, by Soumen Chakrabarti
Chaomei Chen
Published online 20 November 2003
 

276

 

The Library's Legal Answer Book, by Mary Minow and Tomas A. Lipinski
Kenneth Einar Himma
Published online 18 November 2003
 

278

 

The Internet in Everyday Life, edited by Barry Wellman and Caroline Haythornthwaite
Pramod K. Nayar
Published online 14 November 2003


ASIST Home Page

Association for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright © 2003, Association for Information Science and Technology