JASIST IndexJASIST Table of Contents

Journal of the Association for Information Science and technology



In This Issue
Bert R. Boyce





A Unified Maximum Likelihood Approach to Document Retrieval
    David Bodoff, Daniel Enache, Ajit Kambil, Gary Simon, and Alex Yukhimets
    Published online 21 June 2001

In the use of feedback data in a retrieval system it is unclear whether documents or queries should be adjusted. We can easily visualize moving document vectors toward query vectors, or vice versa, but the data supplies only relative proximity, and a combination approach is hard to visualize with either a probabilistic or vector space model. Bodoff, et alia, suggest the maximization of a MultiDimensional Scaling function that incorporates the cosines of the document query vectors judged to be relevant, the cosines of the ideal and original document vectors, as well as the cosigns of the ideal and original query vectors. This MDS approach causes the effected documents and queries to all move relative to one another. However the proper ideal vectors of the components are not clear. A simulation indicates that the maximum likelihood approach will lead to improved estimates of both ideal document and query vectors, although the new query estimates have no continuing value for future retrieval, and that the method is robust even with a small number of assessments per document.











Information Discovery from Complementary Literatures: Categorizing Viruses as Potential Weapons
    Don R. Swanson, Neil R. Smalheiser, and A. Bookstein
    Published online 22 June 2001

To Swanson, Smalheiser, and Bookstein, the half million virus records in Medline, 320,000 with virus subject headings and the same number with virus disease headings, and with a 120,000 intersection, suggest a fragmented and unmanageable literature. To identify biological warfare viruses, subject heading searches created a set of documents on both genetics and pathogenicty of viruses, a set on the survivability, viability and stability of viruses, a set on aerosol transmission of viral diseases, and finally, transmission by insect vectors; sets with an empty intersection. The pathogenitic set (a biological warefare requirement) intersected with the other three yielded only 14 documents. However, documents could exist that had the characteristics of the pathogenitic set as they apply to a specific virus, but not the characteristics of the other sets, while documents on the same virus might exist in any of the other three sets but not in the pathogenetic set since these characteristics were not investigated in these papers. One would need awareness of both sorts of such papers to associate the virus with both sets of characteristics, but could not use an unknown virus name to bring them together. One needs the virus names common to the two sets.

The Arrowsmith software takes two sets as input and produces a list of terms common to both and when run on the pathogenetic set and each of the other three will produce a list of common subject headings for each. From this list the virus names can be culled and listed indicating which and how many set pairs produced them. Such a list provides the raw material for human evaluation of candidate viruses. To test the technique such a list was compared to a previously produced list of possible BW viruses and was found to contain most of these, with a significant relationship inferring the new virus names to be similar to the old. It would appear that a virus can be classified as a potential weapon based upon literature structures associated with it.

















Predicting the Relevance of a Library Catalog Search
    Michael D. Cooper and HuiMin Chen
    Published online 22 June 2001

By relevance of a session Cooper and Chen mean that during a user interaction with a webbased OPAC the user saves, prints, mails, or downloads a citation. Thus the indication of an active choice of a single, or of several, citations for such a purpose is considered to mark that citation as relevant, and the session as a whole as successful or relevant. The number of such choices in a session is a variable that could indicate the degree of relevance of the session, but here relevance is considered a binary condition. A good logging facility makes such information available, although it will under report to some extent since the user may make a note by hand or use the print screen button which may not be logged.

They then look at a number of observable variables, and at averages and proportions derived from these, that together quantitatively describe session activity. These are reduced through principle components analysis and then regressed logistically against the binary relevance indicator to see if they might be used to predict relevance of the session. Over nine hundred thousand Melvyl Webbased sessions were divided into 10 equal strata using the relevance indicator. Nine of these are regressed, and the coefficients used to form a regression equation for the prediction of outcome in the tenth stratum. The process was run 10 times reserving a different stratum on each run to determine if the predictions would be consistent. About 18% of the sessions were relevant versus a prediction of 11%. Relevant sessions are more than twice as long as nonrelevant sessions, more than twice the pages are viewed and they also use more databases and more indices.















Visual Based Retrieval Systems and Web MiningIntroduction
    S.S. Iyengar
    Published online 19 June 2001

A current trend in the proliferation of computing technologies throughout society is development, manipulation and storage of multimedia information.  Images, video, and sound are now considered first-class data types. Web mining and image retrieval techniques discover patterns and relationships by constructing predictive and descriptive computation models.  A relatively newer aspect of information/decision fusion problems in emerging applications is the requirement of fast and efficient computation of information fusion operations. This special issue is composed of articles focused on Web content mining, artificial neural networks as tools for image retrieval, content-based image retrieval systems, and personalizing the Web browsing experience using media agents.  The articles display the diversity of image retrieval algorithm both in terms of theory and practice.










Web Mining for Web Image Retrieval
    Zheng Chen, Liu Wenyin, Feng Zhang, Mingjing Li, and Hongjiang Zhang
    Published online 19 June 2001

Chen et al. presents an effective approach fro image retrieval from the Internet using Web techniques.  The novelty of this method is that the system can also serve as a Web engine search engine.  The key idea in that approach is to extract the text information on the Web pages to semantically describe the images.






Content Based Image Retrieval and Information Theory: A General Approach
    John Zachary, S.S. Iyengar, and Jacob Barhen
    Published online 20 June 2001

Among the many visual features than have been studies, the distribution of color pixels in an image is the most common visual feature studies.  Zachary et al. propose a theoretical foundation of image entropy and practical description of the merits and limitations of image entropy compared to color histograms.  This result suggests that image entropy is a promising approach to image description and representation.







 A Media Agent for Automatically Building a Personalized Semantic Index of Web Media Objects
    Liu Wenyin, Zheng Chen, Mingjiang Li, and Hongjiang Zhang
    Published online 15 June 2001

   This article describes a multimedia approach to solve image retrieval problems..






Information Theoretic Similarity Measures for Content Based Image Retrieval
    John Zachary and S.S. Iyengar
    Published online 21 June 2001

Content-based image retrieval is based on the idea of extracting visual features from images and using them to index images in a multimedia database.  The comparisons that determine similarity between images depend on the representations of the features and the definition of appropriate distance functions.  This article by Zachary and Iyengar proposes similarity measures and an indexing algorithm based on information theory that permits an image to be represented as a single number.  When used in an conjunction with vectors, the method offers improved efficiency when querying large databases.








Web Image Retrieval Using SelfOrganizing Feature Map
    Qishi Wu, S. Sitharama Iyengar, and Mengxia Zhu
    Published online 18 June 2001

 Wu et al. Present a neural network approach for effective retrieval process using a Self-Organizing Feature Map (SOFM) neural network algorithm.








International Summer School on the Digital Library

In this issue we begin with the Frohlich and Resler report on the evaluation of research productivity at the University of Texas Institute for Geophysics by bibliometric indicators using a four part categorization of journals into mainstream, archival, proceedings and other, and five different journal halflife computations. Halflife computations are drawn from work on earthquakes, which like citations do not decay strictly exponentially, requiring formulations avoiding such features. Weighting to emphasize points with more numerous and reliable data seem called for in both instances, and in both the discrete period of reporting can bias the results. This implies ignoring very high and low magnitude data as found in the first year or two of citation and data older than some point, weighting data with more observations more strongly when fitting, and perhaps correcting for the interval problem. A sequence of rates by year will yield a median age of citation, or by dividing 10 log2 by the difference in the logs of two rates, commonly the second and twelfth year rates, a two point halflife. If one determines the slope of the same 11 points, weighting each number by the number of articles used to determine the rate, a negative log2 divided by this slope will provide a weighted least squares halflife. If the sum of the products of the rates and their times in years is divided by the sum of the rates we get a rate averaged time and the maximum likelihood half life is natural log2 times the difference between the rate average time and the minimum rate time. This relatively easily computed method is in common use in earthquake analysis. A ratio method is also possible where the sum of the rates for a given period of years are divided by the sum of the rates for an equal period of years following.

In all categories the maximum likelihood method was lowest and the two point method the highest. All methods found the shortest half life for the other category. The range for mainstream journals was from 5.25 to 8.15 years. Forty three percent of the publications appeared in mainstream journals but they drew 71% of all citations. Differences in half lives are meaningful only if calculated in the same manner and are fairly large, say a factor of two. The ratio method is sound where the data fit a decay model, but the median life method is useful where it does not, if sufficient data over time are available.

















ASIST Home Page

Association for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:

Copyright 2001, Association for Information Science and Technology