Journal of the Association for Information Science and Technology

Index
Table of Contents

Volume 55  Issue 7


 

In This Issue

563

 

In this issue
Bert Boyce
 

 

Research

565

 

 

 

 

 

 

 

 

 

 

Information Retrieval by Metabrowsing
F. Wiesman, H.J. van den Herik, and A. Hasman
Published online 17 February 2004

Wiesman, van den Herik, and Hasman consider six difficulties in information retrieval expression of information need, communication of that need to the system, implicit inter-human communication, indexing consistency, reliability of retrieved items, and the need of the searcher for five distinct knowledge types (system procedural, domain, search strategy, indexing policy, and search tactics). Since humans are good at recognizing relevance but not at describing it, browsing can overcome these difficulties. In particular they suggest metabrowsing, the browsing of information about documents' domain, contents, location, and relations to other documents, rather than of the documents themselves. They represent domain with a simplified version of the Unified Medical Language System and use 36,000 1995 Medline records for documents, each linked to the domain file by their assigned primary or secondary index terms. A key term is chosen from an alphabetical list, its preferred term is substituted, and a window opened around this preferred term in the domain. Related terms may be added to this window with arcs indicating the relation type and clickable definitions. A new screen will give sub-terms of the chosen term and links to documents so indexed. The document's other terms can be displayed or its content presented. Bookmarking, backtracking, and a history list provide for reorientation, if needed. A test group of 24 second- and fourth-year medical students used the system and WinSpires on the same file with three questions designed by domain experts who also evaluated the retrieved documents. Overall, there was no significant difference in effectiveness or user satisfaction, and the system was less efficient for fourth-year students who also were more satisfied with WinSpires.
 

579

 

 

 

 

 

 

 

Improving Performance of Text Categorization by Combining Filtering and Support Vector Machines
Irene Díaz, José Ranilla, Elena Montańes, Javier Fernández, and Elías F. Combarro
Published online 20 February 2004

Diaz et alia believe text categorization, the automatic classification of documents reduced to weighted stem counts and, in this case, assigned to categories by a Support Vector Machine (SVM), can be improved by feature reduction techniques despite the SVM's unique capability of handling large feature spaces. They compare the effect of term frequency, inverse document frequency, and information gain, as reduction techniques on expert classed collections; the Reuters-21578 corpus, and three subsets of the Osmand Medline collection, using fixed training sets and parameters for the SVM. They define precision as the number of true positives over the sum of the number of true and false positives; and recall as the number of true positives over the sum of the number of true positives and false negatives and use van Rijsbergn's combined measure with equal weights. The filtering has no effect on precision, but all methods provide a significant improvement in recall, and thus the combined measure, over unreduced text. Information gain is the best performer at aggressive filtering levels.
 

593

 

 

 

 

 

 

 

 

 

A Formal Knowledge Management Ontology Conduct, Activities, Resources, and Influences
C.W. Holsapple and K.D. Joshi
Published online 25 February 2004

Holsapple and Joshi develop an ontology, or set of definitions and axioms, which can be used to characterize knowledge management as a discipline. The goal is to identify and express the knowledge manipulation activities that fall within that domain. They begin by setting the conditions for their design, namely, that their result occurs in business settings, describes KM phenomena, and captures concepts at two or more levels of detail. Then, they collected KM case studies, surveys, and articles as a source for terminology, and chose terms via multiple iterations until their satisfaction as to helpfulness, comprehensiveness, and unification was attained. Interacting by questionnaire with a panel of 31 KM researchers and practitioners, the four initial components of their framework (conduct, resources, knowledge manipulation, and KM influences) were reviewed for completeness, accuracy, clarity, and conciseness and the whole reviewed for utility, comprehensiveness, unification, and limitations. The resulting revision along with a summary of comments was again sent to the panel, evaluated by questionnaire, and the process repeated until no further revision occurred. Ninety-four percent of panelists were at least moderately satisfied with the ontology. Eighty-one percent felt the ontology was at least moderately successful in terms of providing a unified and comprehensive view. Sixty percent considered the result to be either helpful or extremely helpful to researchers, and 70% felt it was at least moderately helpful to practitioners.
 

613

 

 

 

 

 

 

 


 

 

 

An Entropy-Based Interpretation of Retrieval Status Value-Based Retrieval, and Its Application to the Computation of Term and Query Discrimination Value
Sándor Dominich, Júlia Góth, Tamás Kiezer, and Zoltán Szlávik
Published online 5 February 2004

Dominich et alia show that any Retrieval Status Value (RSV) based retrieval model can be seen as a probability space where the amount of associated Shannon-type information is decreased by retrieval operations, that is to say, as an Uncertainty Decreasing Operation (UDO) probability space. Thus, a term's discrimination value can be based upon its reduction of the UDO space entropy, rather than upon its reduction of Euclidean space as in the vector space model, and term discrimination values become available to any RSV system. The term discrimination values (TDV) for an 82 document ADI test collection that gave 915 terms, time stop listed and Porter stemmed, were computed by each method. About half the terms using UDO have a 100% TDV, and each such term has a positive vector space based TDV indicating agreement on good discrimination. Most of the terms with UDO based TDV between 80% and 100% have positive vector space based TDVs, while those between 40% and 80% have near-zero vector space discrimination values. UDO may be used to compute a discrimination value for queries, and such values were computed for 35 ADI test queries. The fewer relevant answers a query has, the higher its discrimination value was found to be, except for query 27 where all terms have very high document frequencies and the query is extremely general. Retrieval tests on ADI using both weights indicates that UDO weights enhance precision at recall levels above 50%, but perform equally at lower recall levels. Tests on three additional databases of various similarity measures show that dot product reduces entropy to the greatest extent and that cosine produces the least entropy reduction. The use of normalized frequency weighting reduces entropy to the greatest extent, while lack of normalization gave the least entropy reduction. UDO is faster, and less restrictive. 
 

628

 

 

 

 

 

 

 

 

The Effects of Fitness Functions on Genetic Programming-Based Ranking Discovery for Web Search
Weiguo Fan, Edward A. Fox, Praveen Pathak, and Harris Wu
Published online 17 February 2004

Fan et alia find fitness function design important in the improvement of Genetic Programming based ranking functions for Web retrieval. Candidate ranking functions are represented as individuals in a GP population tree structure and evolved to find those with better fitness values. Average precision, which does not preserve rank order information, has been the reasonably effective common fitness function, but other possibilities may improve performance. The ideal utility function preserves rank order information and is non-linear with high values for documents ranked at the top of the list and quickly losing value as the rank increases. Four functions are designed to meet these requirements. Chang and Kwok and Lopez-Pujalte et alia each provide functions that preserve rank order information with the Lopez-Pujalte function incorporating negative values for nonrelevant documents. As an experimental baseline, the Okapi BM25 ranking formula is used with the TREC 10GB collection of 1.69 million documents and 100 queries from TREC 9 and TREC 10 in a vector space format. The fitness function in use had a noticeable effect on performance with three of the new functions showing strong improvement.
 

637

 

 

 

 

 

 


 

 

Query Association Surrogates for Web Search
Falk Scholer, Hugh E. Williams, and Andrew Turpin
Published online 25 February 2004

Scholer, Williams, and Turpin construct document surrogates by supplementing existing document texts with terms from queries that dropped these documents as the top N (thirty nine) of a retrieved list based upon the Okapi BM25 similarity measure, and limiting such supplementation to M (nineteen) queries per document. When the set limits are reached, new query terms with higher similarity measures can supplant those in existence. However, only terms that appear in the document as well as the associated query may be added to the surrogate, so that it is the weight of these terms that changes in the document surrogate. They also create surrogates that are a set of such query terms without the original document surrogate. The 1.69 million Web documents of TREC WT10g make up the experimental collection, which is searched for title word strings (stop listed but not stemmed) from 50 queries each from TREC-9 and TREC-2001 without relevance feedback. Queries for creation of supplements came from some 900,000 logged Excite queries. Query association improved mean average precision by 4.3%, and mean average precision at 10 by 7%. Adding anchor terms has no effect on queries that did well, but, this reduces performance of those below the baseline even further. Query term surrogates without full text are 6% less effective under average precision at 10 than text alone. Query associations did not appear helpful for named page finding, and a dynamic parameter setting for M and N does not lead to improvement.
 

 

Book Reviews

651


 

A History of Online Information Services, 1963­1976, by Charles P. Bourne and Trudi Bellardo Hahn
Derek G. Smith
Published online 6 February 2004
 

652

 

Research Questions for the Twenty-first Century, edited by Mary Jo Lynch
Lydia Eato Harris
Published online 23 February 2004


ASIST Home Page

Association for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright © 2004, Association for Information Science and Technology