

SPECIAL TOPIC ISSUE 
287

MATHEMATICAL METHODS IN INFORMATION RETRIEVAL Mathematical,
Logical and Formal Methods in Information Retrieval: An Introduction to the Special Issue
Fabio Crestani, Sandor Dominich, Mounia Lalmas, and Cornelis Joost van Rijsbergen Research on the use of mathematical, logical, and formal methods, has been central to Information Retrieval research for
long time. Research in this area is important not only because it helps enhancing retrieval effectiveness, but also because it helps clarifying the underlying concepts of Information Retrieval. In this article we
outline some of the major aspects of the subject, and summarize the papers of this special issue with respect to how they relate to these aspects. We conclude by highlighting some directions of future research, which
are needed to better understand the formal characteristics of Information Retrieval. 
291

Embedding Term Similarity and Inverse Document Frequency into a Logical Model of Information Retrieval David E. Losada and Alvaro Barreiro This article is a clear expression of the authors' belief according to which the statistical techniques used in
classical IR should be merged with Knowledge Representation (KR) methods. They propose and discuss an extension of their earlier logical model of IR so as to deal with similarity relationships between terms and inverse
document frequency weights with the aim to enhance retrieval effectiveness. Given a set of index terms, and a collection of documents, each document is represented as a DNF Boolean expression of index terms, which can
consist of one single clause (i.e., the document is represented as one conjunction of terms) or several clauses (e.g., separate clauses for title and abstract). The query is also represented as a DNF Boolean expression
of its terms. A measure of similarity between a query and document is derived using the number of terms that they share. This model may be viewed as a binary weighted vector model using logical formulas rather than
logical truth values. Its main advantage is that it allows the user to specify not only what he/she wants to appear in a document but also what he/she does not want to occur (via negated term in the query). Two
extensions are elaborated: one incorporating similarity relationships between terms, and another one making use of inverse document weights. The retrieval algorithms are presented in great details, and it is showed that
they are tractable, as opposed to an earlier one, which proved to run in exponential time. Experiments were carried out for both cases: incorporating term similarities and applying inverse document frequencies. In the
first case, the EMIM (Expected Mutual Information Measure) was used to express term similarities. Results show an increment of 69% in average precision compared to the average precision of the model without term
similarity. In the second extension of the model, results show an enhancement of 3338% of the average precision A third extension using both term similarity and inverse document frequency is also evaluated; the
increment of average precision varies between 2836%. The results obtained using the extended models are compared with those of SMART. When the model is extended using inverse document frequencies, the recallprecision
values of both SMART and the extended model are very close to each other. 
307

Implementing Relevance Feedback in the Bayesian Network Retrieval Model
Luis M. de Campos, Juan M. FernandezLuna, and Juan F. Huete This article proposes and discusses relevance feedback techniques for the Bayesian Network IR (BNIR) model. After a formal
description of the BNIR model, a relevance feedback methodology is defined. Based on an initial set of retrieved documents to his or her query, the user feeds back new evidence into the network, a new query is
built, and new posterior probabilities are calculated. The user's feedback is incorporated into the network as partial knowledge using a Pearl algorithm for polytrees. Terms can be reweighted, depending on whether they
occur in relevant or nonrelevant documents, and the influence of term reweighting is discussed. The expansion of queries with new terms is also examined in detail. Experiments carried out are reported to observe the
behaviour of the relevance feedback techniques suggested. The influence of the feedback method used (term reweighting, query expansion, and both) is evaluated with the Residual Collection method. The results obtained
are comparable with those obtained with other models but perhaps the most interesting observation made by the authors is that effectiveness is very sensitive to the quality of the initially retrieved set of documents.

319 
Unitary Operators on the Document Space Eduard Hoenkamp This
article is a "sympathetic example of an artistic metaphor fertilising scientific research": the search space is viewed as a grayscale image that can be characterized by a few major brush strokes. The author's starting
point is the classical vector space model of IR in which the documents are conceived as vectors of the terms linear space, and relevance is estimated by the traditional cosine measure. Using the LSI (Latent Semantic
Indexing) technique, the dimension of this space, and thus computational costs, too, can be considerably reduced because LSI makes use of SVD (Singular Value Decomposition), which maps documents onto the much lower
dimensional space of the eigenvectors of the termbydocument matrix. The article explores and proposes computationally more economical alternatives to SVD. The cosine measure defines a Hilbert space on documents;
hence, operators that preserve clusters are of interest. More exactly, such operators should obey a series of properties: they should be unitary (to preserve cohesion of documents), and they should reduce dimension and
computational complexity (costs). It is argued that there are several such operators, which are wellknown in Signal Processing, for example, the Fourier transform, JPEG, wavelets. The termbydocument
matrix is viewed as a 2D grayscale image, with weights expressing degrees of greyness. The author examines the application of Hilbert spaces, and of a special wavelet called the Haar transform. Based on the
surprisingly simple observation that any two numbers can be represented by their average incremented/decremented by the same amount (e.g., the numbers 1 and 5 are represented by their average 3 plus/minus 2,
respectively), the author shows how the Haar transform can be obtained for any document vector. The computational complexity of this technique is linear, whereas that of SVD is quadratic. 
326 
Towards a Theory of Context Sensitive Information Inference
D. Song and P.D. Bruza This article is a promising example of enhancing the vector space model with indexing based on rules and logical inference. Information is represented by symbolic tokens, inference
is a sequential process based on rules. Information can be represented on several layers: symbolic (propositions), conceptual (geometric), associationist (connectionism). To implement the conceptual level, a Hyperspace
Analogue to Language (HAL) model can be used where each word is conceived as a vector of weights corresponding to other words. Using HAL vectors, a computational model of conceptual space is defined. A concept is
defined as a vector of weights corresponding to its properties, whereas the space consists of concepts. Each concept is assumed to have some dominant properties that obtain higher weights. Informational inference is
formalized using Barwise and Seligman's definition of information flow. Concepts can be combined on their dominant properties to obtain a new concept. Experiments using several TREC topics are reported, where query
terms are ranked according to their tf/idf weights, and the dominance of a term over another is expressed by the difference of the corresponding weights. These terms are combined starting from the most dominant term
proceeding downwards. The resulting query vector is matched against documents also represented as tf/idf vectors. The results show that this model outperforms the Markov chainbased query model, and is promising in that
its effectiveness is comparable to that of probabilistic relevance models. 
340

Query Expansion and Query Translation as Logical Inference JianYun Nie This article proposes a unified formal framework for two IR problems, which are usually seen as distinct problems: CrossLanguage IR (CLIR) and Query Expansion (QE). The main problem of CLIR is query
translation, whereas that of QE is to appropriately expand the query with new terms. The article proposes a model, referred to as inferential IR, and shows that query translation is a special case of QE. Inferential IR
is based on the idea that a document implies a query in a given context and with some degree of (un)certainty. QE can be viewed as in inference in that we are looking for terms that imply the original query terms. The
case when both documents and queries are represented as Boolean expressions of terms is discussed in detail. The problem of estimating the degree of uncertainty of inference is treated using fuzzy techniques and
probability calculus. 
352 
Enhancing Retrieval with Hyperlinks: A General Model Based on Propositional Argumentation Systems Justin Picard and Jacques Savoy One of the main characteristics of the World Wide Web is that its pages are/can be hyperlinked. This feature does/should influence
retrieval effectiveness. This article proposes a formal model for this that uses a logical technique, which is referred to as Probabilistic Argumentation System (PAS). In PAS, the drawback of classical propositional
logic of not being able to handle uncertainty is removed by introducing special and new propositions, called assumptions, expressing conditions the propositions (rules) depend on. A formalism for PAS is described, which
allows for improving the original ranking of documents provided by a search engine. A discussion of Web page popularity measures is presented, and it is shown how applying PAS techniques can enhance these. The
visibility equal popularity philosophy used by the Page Rank algorithm of Google has to do next to nothing with quality, which would only be incorporated if user preferences would be taken into account; this could be
formally modeled in PAS. Drawbacks of the Kleinberg algorithm to find hubs and authorities are presented together with possible solutions based on PAS techniques. 

Book Reviews 
361 
The Creation and Persistance of Misinformation in Shared Library Catalogs: Language and Subject Knowledge in a Technical Era.. David Bade,
Urbana, IL: Graduate School of Library and Information Science, University of Illinois; April 2002: 33 pp. Price: $8.00. (ISBN: 087845120X.)
Shirley Lincicum 
363

Digital Creativity: Techniques for Digital Media and the Internet. Bruce Wands. New York: John Wiley & Sons, Inc., 2002. 326 pp. Price
$54.95 (ISBN: 0471390577). James L. Van Roekel 
364

Current Theory in Library and Information Science. Library Trends, Winter, 2002, 50(3), 309574. Edited by William E. McGrath . Urbana, IL:
Graduate School of Library and Information Science, University of Illinois; Price: $25.00. Donald O. Case 
365 
Models for Library Management, DecisionMaking, and Planning. Robert M. Hayes. San Diego, CA: Academic Press; 2001; 278 pp.; Price; $99.95
(ISBN: 0123341515.) Scott Nicholson 
