|
|
|
In This Issue
|
|
187
|
In this issue Bert Boyce
|
|
|
Research
|
|
189
|
Arabic Morphological Analysis Techniques A Comprehensive Survey Imad A. Al-Sughaiyer and Ibrahim A. Al-Kharashi Published online 20 November 2003
Al-Sughaiyer and Al-Kharashi provide definitions of standard linguistic terms as they are seen in Arabic analysis and identify
efficiency, compactness, bi-directionality, success rate, and retrieval performance as the measures of the effectiveness of morphological analysis algorithms. After a review of the Arabic
morphological analysis literature, they suggest the approaches may be classified as table lookup (large construction demands and space requirements), linguistic (require a large number of lists and
removing affixes by trial and error), combinatorial (large space and time requirements), or rule based (the authors' choice), and they present a summary of the work in each area. The majority of work
is linguistic in nature, but little comparison of existing work exists. Evaluation of suggested algorithms is weak. Use of a word's root (the single basic morpheme) in an Arabic index leads to
invalid conflation.
|
|
214
|
Predicting Library of Congress Classifications From Library of Congress Subject Headings Eibe Frank and Gordon W. Paynter Published online 28 October 2003
Frank and Paynter attempt to assign LC Classification number ranges to INFOMINE documents based on their assigned LCSH headings in
order to provide a better browsing capability. Since they claim that retrospective assignment of a class number is logistically impossible for those librarians that already assign terms from LCSH to
INFOMINE records, they have devised a machine-learning technique to create the classification number where, rather than creating virtual LCSH documents to represent each LC class and using similarity
measures to assign documents, they use a support vector machine classifier to determine which of the top 21 nodes is most likely and classifiers at each successive level until a leaf is reached or
the classifier chooses itself. They utilize LCSH terms without subdivisions, and also make use of intervals from the LCC outline available on the Pharos Web site, both processed and extracted from
existing MARC records to create a training set. The training set of 868,836 records was drawn from the UC Riverside library catalog with 50,000 items reserved for testing and the remainder used for
training at different levels. Accuracy increases with training set size but returns diminish. Accuracy increases from 32% to 55% as the training set size increases from 10,000 to 800,000. Less than
7% of errors are due to the classifier terminating too early or too late. With the large test set 80% of the first array classification decisions are correct and 16% at the seventh level. The
learning algorithm scales at the order of n instances to the 1.7, and test processing proceeds at a rate of 21 instances per second.
|
|
228
|
A Nonlinear Model of Information-Seeking Behavior Allen Foster Published online 11 November 2003
Foster disagrees with the conventional wisdom view of information seeking as a linear process of identifiable stages and iteration,
particularly as it would apply to interdisciplinary information-seeking behavior, and he attempts a non-linear model based on identifying the processes, contexts, and behaviors of such
interdisciplinary activity and their relationships. In-depth structured interviews conducted in the workplace environment were utilized to collect data on searching examples provided by the subjects.
Subjects were purposively selected from the University of Sheffield across multiple faculties for their interdisciplinary research and then used as the kernel for a snowball sampling expansion
resulting in 45 faculty-diverse participants. Transcript coding took place in multiple iterations and the final results were confirmed by participant review. Activities viewed in conjunction with
time lines did not support a linear stages model. The new model groups activities into three core categories; "opening" (moving from orientation to actual search), "orientation" (identifying existing
research and a direction for search), and "consolidation" (refining and knowing when to stop). These operate with the boundaries of an "external context" which incorporates social and organizational
influences, time, project and access constraints, and navigational issues. Within the external context one finds an internal context which incorporates the individual's experience, prior knowledge,
and feelings and which are individually unique. Four cognitive approaches were identified flexible and adaptable to other cultures, an open approach with no prior framework, a nomadic approach which
actively seeks diverse ways of access, and a holistic approach which attempts to bring diverse areas together. Interaction among the core activities was cumulative, reiterative, holistic, and
context-bound.
|
|
238
|
Indicators of Accuracy for Answers to Ready Reference Questions on the Internet Martin Frick‚ and Don Fallis Published online 19 November 2003
Frické and Fallis explore the validity of proposed indicators of the accuracy of ready reference information to be found in Web
sites. Using 49 of the 60 questions previously used by Connell and Tipple, AltaVista searches were run to identify potential answer sites, the first five of which actually answered the question
chosen, and then evaluated for answer accuracy and checked for the presence of indicators of accuracy. This was followed by a Google search to yield these and at most five additional sites. Each site
was manually scored as completely accurate, partially accurate, partially inaccurate, or completely inaccurate and checked for owner entity type, recency of update, presence of advertising, copyright
claim, appeal to authority, and the presence of any awards for quality, as well as its ranked position by the search engine, its Google PageRank (010) position, and the number of in-links found with
the AltaVista link command. Contingency tables were formed and chi-square used to determine possible correlation. Likelihood ratios for presence and absence of indicators and indicator pairs were
also computed. Of 300 sites that answered the questions, 214 were judged completely accurate and only 25 inaccurate. High display position, high Google PageRank, currency, copyright and in-link count
all yield a chi-square probability of less than .05, suggesting a relationship to accuracy.
|
|
246
|
The Effects of Domain Knowledge on Search Tactic Formulation Barbara M. Wildemuth Published online 13 November 2003
Wildemuth is interested in whether a growing understanding of the knowledge domain covered by a database will affect the sequence of
searching moves (tactics) used by medical students searching that database. Two random samples were drawn from entering medical school classes, excluding those with advanced science degrees and those
whose undergraduate degree was in microbiology, the topic of the database. Each was asked to address six specific clinical problems involving several specific questions; first, prior to any
instruction in microbiology, resulting in a 12.6% success rate; second, directly after the microbiology course, resulting in a 48.1% success rate; and finally, six months after the course, achieving
a 27.3% success rate. In each instance subjects were asked to respond from their own knowledge and then to search the database for a question for which they had provided an incorrect response. The
nearly 1,300 searches were recorded by transaction logs and hand coded according to an adaption of the Shute & Smith scheme incorporating beginning moves, reduction moves, expansion moves, and
term replacement. A transition matrix showing the frequency of transitions from one coded move to every other coded move was created and used to create a graphic representation of transitions that
accounted for at least 1% of all occurrences. Maximal repeating patterns of moves were also extracted and the most frequently occurring retained. The most common pattern was the entry of a new
concept followed by the addition of one or more concepts prior to display. Number of moves decreased with experience. Database usage increases performance at all three levels of experience.
|
|
259
|
A Graph Model for E-Commerce Recommender Systems Zan Huang, Wingyan Chung, and Hsinchun Chen
Published online 14 November 2003
Huang, Chung, and Chen are interested in maximizing the value of the product and usage information available from online
transactions for those that supply material and those that interact with that supplied material. Such information needs to be represented in a flexible manner, since different recommendation
approaches are typically used to create recommender systems that find associations between users and items and use discovered associations to recommend additional items to previous users. A two-layer
graph model is implemented with users and items as nodes in separate layers and transactions and similarities as links. Nodes are kept as relative similarity measures to other nodes. If links in the
item layer are activated, the approach is content-based. If user and inter-layer links are activated, the approach is collaborative and activating all links gives a hybrid approach. A direct
retrieval approach retrieves items similar to those used previously by a user or similar users. A collaborative recommendation forms a list of similar users by either past common item selection or by
common demographics and recommends that list's past selections. An association mining method was used with the three approaches, each generating a different set of association rules with transitive
rules in a Hopfield net utilized as an option to overcome sparse user ratings. Testing on a Chinese online bookstore data set provided records for 9,695 books, 2,000 customers, and 18,771
transactions. Books and customers were described as feature vectors and similarity measures computed and customers' purchase lists were halved to provide a predicted set from the first half allowing
recall and precision-type measures. Pairwise t-tests were then applied. The hybrid approach was the best performer, but the spreading activation approach did not better significantly the associative
mining approach or direct search.
|
|
|
Book Reviews
|
|
275
|
Mining the Web Discovering Knowledge From Hypertext Data, by Soumen Chakrabarti Chaomei Chen Published online 20 November 2003
|
|
276
|
The Library's Legal Answer Book, by Mary Minow and Tomas A. Lipinski Kenneth Einar Himma Published online 18 November 2003
|
|
278
|
The Internet in Everyday Life, edited by Barry Wellman and Caroline Haythornthwaite Pramod K. Nayar Published online 14 November 2003
|
|
|