|
EDITORIAL |
| |
In This Issue Bert R. Boyce |
197
|
RESEARCH |
| |
An Algorithm for Term Conflation Based on Tree Structures
Irene Diaz, Jorge Morato, and Juan Llorens Published online 28 December 2001 In this issue Diaz et alia describe
Normalizer, a conflation or stemming algorithm which stores its prefixes and suffixes in tree structures to reduce required space and complexity. A comparison run with the Porter and Krovetz stemmer on five English text
documents shows a reduction in error percentages. Thirty-six Spanish documents were also conflated and the results evaluated as to effectiveness of different rule sets in use. Regular verb transformations were most used
and had the highest error rate. Irregular verb rules were the next most used followed by regular noun rules.
|
199 |
| |
A New Model that Generates Lotka's Law John C. Huber Published online 28 December 2001Huber assumes in his new bibliometric model that the author
productivity rate follows the Generalized Pareto Distribution, that the author's career duration follows an exponential distribution, that productivity and career duration are related, and that publications are
distributed across the career duration in a Poisson manner. By simulating authors' contributions over time he is able to use the model to generate Lotka's law and good fits with empirical distributions.
|
209 |
| |
Collaborative Relevance Judgment: A Group Consensus Method for Evaluating User Search Performance Xiangmin Zhang Published online 28 December 2001 Zhang
believes that since research is often a collaborative process the evaluation of retrieved search results in such cases should not be the subjective evaluation of the single searcher, but rather that relevance judgement
should be a collaborative evaluation. This is currently practiced in TREC's use of pooled assessments. He suggests that a group can be formed by any set of people with common interests represented by searched questions.
This group then pools all retrieved documents, or ideally all retrieved documents judged relevant by each searcher. This may mean simply the best retrieved set, not those in that set deemed relevant. Documents are
weighted by the number of users who retrieved them, and the relevant set considered to be those above some chosen threshold. A user's relevance score is then the sum of the weights of the documents retrieved divided by
the number of items in the consensus set minus the number of documents the user retrieved, plus the number of not relevant documents the user retrieved. Zhang then conducts an experiment with 56 student
volunteers in which educational level, academic background, computer experience, and native language were tested for their effect on search performance as measured by the collaborative relevance score on four questions
each participant searched. The collaborative relevance scores appear to vary with standard recall and precision measures. Educational level had a significant effect, but native English speaking did not.
Science/engineering students had higher scores than those in the social sciences and humanities. Differences in computer experience were not statistically significant.
|
220 |
| |
Will This Paper Ever Be Cited? Quentin L. Burrell Published online 28 December 2001 For a homogenous set of papers given the average rate at which a
paper attracts citations, Burrell calculates the probability that a paper will ever be cited assuming it has not been cited in a given time. The longer the elapsed time without citation the greater the likelihood it
will never be cited.
|
232
|
| |
A Context Vector Model for Information Retrieval
Holger Billhardt, Daniel Borrajo, and Victor Maojo Published online 28 December 2001 Billhardt, Borrajo, and
Maojo create a matrix of term context vectors whose values are the normalized co-occurrence frequencies of term pairs in a document across the whole collection. The normal Vector Space Model document vectors are then
transformed into document context vectors by summing the product of each term's weight and its context vector divided by the length of that vector so that the value is the average of the influences of the term on all
terms in the document. Queries are then handled in the normal manner using either term frequency vectors, binary vectors, or query context vectors obtained as the document vectors above. In tests run on
the MED, CRANFIELD, CISI, and CACM collections the terms were first run against a stop list then stemmed and single occurrence stems eliminated. Comparisons were made to the Vector Space Model using IDF weights. In
general small improvements are noted in all collections with differing variants of the context approach having the better effect in different collections. The procedure is expensive in terms of time and memory,
particularly if query context vectors are computed while a response is awaited.
|
236
|
BOOK REVIEWS |
| |
The Map Library in the New Millennium, edited by R.B. Parry and C.R. Perkins
Lisa A. Ennis Published online 27 December 2001 |
250 |
| |
Data Privacy in the Information Age Alan T. Schroeder, Jr. Published online 14 December 2001 |
251
|
| |
CALLS FOR PAPERS |
254 |
|