Journal of the Association for Information Science and Technology

Index
Table of Contents

Volume 54  Issue 4


 

SPECIAL TOPIC ISSUE

287

 

 


 

 

 

MATHEMATICAL METHODS IN INFORMATION RETRIEVAL

Mathematical, Logical and Formal Methods in Information Retrieval: An Introduction to the Special Issue
Fabio Crestani, Sandor Dominich, Mounia Lalmas, and Cornelis Joost van Rijsbergen

Research on the use of mathematical, logical, and formal methods, has been central to Information Retrieval research for long time. Research in this area is important not only because it helps enhancing retrieval effectiveness, but also because it helps clarifying the underlying concepts of Information Retrieval. In this article we outline some of the major aspects of the subject, and summarize the papers of this special issue with respect to how they relate to these aspects. We conclude by highlighting some directions of future research, which are needed to better understand the formal characteristics of Information Retrieval.
 

291

 

 

 

 

 

 

 

 

 


 

 

 

Embedding Term Similarity and Inverse Document Frequency into a Logical Model of Information Retrieval  
David E. Losada and Alvaro Barreiro

This article is a clear expression of the authors' belief according to which the statistical techniques used in classical IR should be merged with Knowledge Representation (KR) methods. They propose and discuss an extension of their earlier logical model of IR so as to deal with similarity relationships between terms and inverse document frequency weights with the aim to enhance retrieval effectiveness. Given a set of index terms, and a collection of documents, each document is represented as a DNF Boolean expression of index terms, which can consist of one single clause (i.e., the document is represented as one conjunction of terms) or several clauses (e.g., separate clauses for title and abstract). The query is also represented as a DNF Boolean expression of its terms. A measure of similarity between a query and document is derived using the number of terms that they share. This model may be viewed as a binary weighted vector model using logical formulas rather than logical truth values. Its main advantage is that it allows the user to specify not only what he/she wants to appear in a document but also what he/she does not want to occur (via negated term in the query). Two extensions are elaborated: one incorporating similarity relationships between terms, and another one making use of inverse document weights. The retrieval algorithms are presented in great details, and it is showed that they are tractable, as opposed to an earlier one, which proved to run in exponential time. Experiments were carried out for both cases: incorporating term similarities and applying inverse document frequencies. In the first case, the EMIM (Expected Mutual Information Measure) was used to express term similarities. Results show an increment of 6-9% in average precision compared to the average precision of the model without term similarity. In the second extension of the model, results show an enhancement of 33-38% of the average precision A third extension using both term similarity and inverse document frequency is also evaluated; the increment of average precision varies between 28-36%. The results obtained using the extended models are compared with those of SMART. When the model is extended using inverse document frequencies, the recall-precision values of both SMART and the extended model are very close to each other. 
 

307

 

 

 


 

 

 

 

Implementing Relevance Feedback in the Bayesian Network Retrieval Model
Luis M. de Campos, Juan M. Fernandez-Luna, and Juan F. Huete  

This article proposes and discusses relevance feedback techniques for the Bayesian Network IR (BNIR) model. After a formal description of the BNIR model, a relevance feedback methodology is defined.  Based on an initial set of retrieved documents to his or her query, the user feeds back new evidence into the network, a new query is built, and new posterior probabilities are calculated. The user's feedback is incorporated into the network as partial knowledge using a Pearl algorithm for polytrees. Terms can be reweighted, depending on whether they occur in relevant or nonrelevant documents, and the influence of term reweighting is discussed. The expansion of queries with new terms is also examined in detail. Experiments carried out are reported to observe the behaviour of the relevance feedback techniques suggested. The influence of the feedback method used (term reweighting, query expansion, and both) is evaluated with the Residual Collection method. The results obtained are comparable with those obtained with other models but perhaps the most interesting observation made by the authors is that effectiveness is very sensitive to the quality of the initially retrieved set of documents.
 

319

 

 

 

 

 

 

 

 

 

 

 

Unitary Operators on the Document Space
Eduard Hoenkamp

This article is a "sympathetic example of an artistic metaphor fertilising scientific research": the search space is viewed as a gray-scale image that can be characterized by a few major brush strokes. The author's starting point is the classical vector space model of IR in which the documents are conceived as vectors of the terms linear space, and relevance is estimated by the traditional cosine measure. Using the LSI (Latent Semantic Indexing) technique, the dimension of this space, and thus computational costs, too, can be considerably reduced because LSI makes use of SVD (Singular Value Decomposition), which maps documents onto the much lower dimensional space of the eigenvectors of the term-by-document matrix. The article explores and proposes computationally more economical alternatives to SVD. The cosine measure defines a Hilbert space on documents; hence, operators that preserve clusters are of interest. More exactly, such operators should obey a series of properties: they should be unitary (to preserve cohesion of documents), and they should reduce dimension and computational complexity (costs). It is argued that there are several such operators, which are well-known in Signal Processing, for

example, the Fourier transform, JPEG, wavelets. The term-by-document matrix is viewed as a 2D gray-scale image, with weights expressing degrees of greyness. The author examines the application of Hilbert spaces, and of a special wavelet called the Haar transform. Based on the surprisingly simple observation that any two numbers can be represented by their average incremented/decremented by the same amount (e.g., the numbers 1 and 5 are represented by their average 3 plus/minus 2, respectively), the author shows how the Haar transform can be obtained for any document vector. The computational complexity of this technique is linear, whereas that of SVD is quadratic.
 

326

 

 

 

 

 

 

 

 

 

Towards a Theory of Context Sensitive Information Inference
D. Song and P.D. Bruza

This article is a promising example of enhancing the vector space model with indexing based on rules and logical inference. Information is represented by symbolic tokens, inference is a sequential process based on rules. Information can be represented on several layers: symbolic (propositions), conceptual (geometric), associationist (connectionism). To implement the conceptual level, a Hyperspace Analogue to Language (HAL) model can be used where each word is conceived as a vector of weights corresponding to other words. Using HAL vectors, a computational model of conceptual space is defined. A concept is defined as a vector of weights corresponding to its properties, whereas the space consists of concepts. Each concept is assumed to have some dominant properties that obtain higher weights. Informational inference is formalized using Barwise and Seligman's definition of information flow. Concepts can be combined on their dominant properties to obtain a new concept. Experiments using several TREC topics are reported, where query terms are ranked according to their tf/idf weights, and the dominance of a term over another is expressed by the difference of the corresponding weights. These terms are combined starting from the most dominant term proceeding downwards. The resulting query vector is matched against documents also represented as tf/idf vectors. The results show that this model outperforms the Markov chain-based query model, and is promising in that its effectiveness is comparable to that of probabilistic relevance models.
 

340

 

 


 

 

 

Query Expansion and Query Translation as Logical Inference
Jian-Yun Nie

This article proposes a unified formal framework for two IR problems, which are usually seen as distinct problems: Cross-Language IR (CLIR) and Query Expansion (QE). The main problem of CLIR is query translation, whereas that of QE is to appropriately expand the query with new terms. The article proposes a model, referred to as inferential IR, and shows that query translation is a special case of QE. Inferential IR is based on the idea that a document implies a query in a given context and with some degree of (un)certainty. QE can be viewed as in inference in that we are looking for terms that imply the original query terms. The case when both documents and queries are represented as Boolean expressions of terms is discussed in detail. The problem of estimating the degree of uncertainty of inference is treated using fuzzy techniques and probability calculus.
 

352

 

 

 

 

 

 

 

Enhancing Retrieval with Hyperlinks: A General Model Based on Propositional Argumentation Systems
Justin Picard and Jacques Savoy

One of the main characteristics of the World Wide Web is that its pages are/can be hyperlinked. This feature does/should influence retrieval effectiveness. This article proposes a formal model for this that uses a logical technique, which is referred to as Probabilistic Argumentation System (PAS). In PAS, the drawback of classical propositional logic of not being able to handle uncertainty is removed by introducing special and new propositions, called assumptions, expressing conditions the propositions (rules) depend on. A formalism for PAS is described, which allows for improving the original ranking of documents provided by a search engine. A discussion of Web page popularity measures is presented, and it is shown how applying PAS techniques can enhance these. The visibility equal popularity philosophy used by the Page Rank algorithm of Google has to do next to nothing with quality, which would only be incorporated if user preferences would be taken into account; this could be formally modeled in PAS. Drawbacks of the Kleinberg algorithm to find hubs and authorities are presented together with possible solutions based on PAS techniques. 
 

 

Book Reviews

361

 

 

The Creation and Persistance of Misinformation in Shared Library Catalogs: Language and Subject Knowledge in a Technical Era..  David Bade, Urbana, IL: Graduate School of Library and Information Science, University of Illinois; April 2002: 33 pp. Price: $8.00. (ISBN: 0-87845-120-X.)
Shirley Lincicum 
 

363


 

Digital Creativity: Techniques for Digital Media and the Internet.  Bruce Wands. New York: John Wiley & Sons, Inc., 2002. 326 pp. Price $54.95 (ISBN: 0-471-39057-7).
James L. Van Roekel
 

364


 

Current Theory in Library and Information Science.  Library Trends, Winter, 2002, 50(3), 309-574. Edited by William E. McGrath . Urbana, IL: Graduate School of Library and Information Science, University of Illinois; Price: $25.00.
Donald O. Case
 

365

 

Models for Library Management, Decision-Making, and Planning.  Robert M. Hayes. San Diego, CA: Academic Press; 2001; 278 pp.; Price; $99.95 (ISBN: 0-12-334151-5.)
Scott Nicholson


ASIST Home Page

Association for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright © 2003, Association for Information Science and Technology