Claudio Carpineto and Giovanni Romano First we have
Carpineto and Romano, who make use of a clustered document file based upon set inclusion relations among terms, merge queries into the clustered document space and consider the shortest path between a query and document
as the basis of a retrieval status value. Typical hierarchical clustering methods do not produce all likely clusters due to arbitrary tie breaking, and fail to discriminate between documents with significantly different
degrees of similarity to a query. In their concept lattice ranking (CLR), a lattice is built on the basis of term co-occurrence in documents and supplemented rather than totally re-computed with the addition of each new
document or query.
Using the CACM and CISI collections and queries, weighted term vectors were computed to be used in best match retrieval, and a hierarchical single link clustering using cosign ranking, for
comparison with CLR. Lattice construction took 15 minutes for CACM and 2 hours for CISI. Both best match and CLR return better precision and recall measures than hierarchical clustering, but little difference appears
between the two. A comparison of CLR and hierarchical clustering on unmatched documents was then carried out using expected search length as a measure. CLR outperforms and may be useful in discovering non-matching
relevant documents.
A Linear Algebra Measure of Cluster Quality
Laura A. Mather
Mather proposes a new measure of cluster effectiveness independent of knowledge of retrieval measures computed for queries
on the clustered file, and based on the theory that the clustering quality of a term document matrix is determined by the disjointedness of the terms across the clusters. The ideal clustering case is that where terms
which occur in one cluster occur only in that cluster, or, that is to say, are mutually exclusive across clusters. Such clusters occur if and only if the matrix is ``block diagonal,'' that is to say, has rows and
columns that can be permuted to produce a matrix that has some set of blocks on the diagonal of the matrix that contain nonzero elements, while the remainder contain zero elements. The singular values of each of the
blocks of a block diagonal matrix are the same as the singular values of a block diagonal matrix when terms are disjoint and as the structure diverges from block diagonal the two sets of singular values diverge as more
term intersection occurs. A measure of the distance between the singular values of the term document matrix and the cluster matrices indicates cluster value, but is difficult to interpret. By taking random permutations
of the matrix and creating clusters one can approximate the mean and standard deviation and by subtracting the mean from the actual observed clustering and dividing by the standard deviation of the samples, one can
produce the number of standard deviations from a random clustering for the observation. These values can be compared to indicate the best clustering. The computation of the singular values of many large matrices is
required and would be expensive. Experimentally the metric correlates significantly with Shaw's F and with the precision measure, increasing as these measures increase.
A Unified Mathematical Definition of Classical Information Retrieval
Sandor Dominich
Dominich reviews the basic retrieval models concentrating upon the vector space and probabilistic
representations. He shows that these retrieval models define systems of vicinities of documents around queries which can both be represented by a similarity space and thus have a unified mathematical definition.
Validating a Geographical Image Retrieval System
Bin Zhu and Hsinchun Chen
Zhu and Chen compare the performance of their Geographical Knowledge Representation System with image retrieval by human subjects.
Gabor filters are used to extract low level features from 1282 pixel tiles cut from aerial photograph images. A 60 feature vector describes each tile and a Euclidean distance similarity measure is used to sort the tile
images by least distance. Adjacent similar tiles are grouped to create regions which in turn are represented with derived vectors. Kohonen's Self Organizing Map (SOM) is created showing tiles representing the textures
to be found in the data. Clicking on these displays the tiles in the same category.
Thirty human subjects were assigned an image and six randomly selected reference tiles to score for similarity to each of the 192
tiles in the image. A second group of ten subjects were asked to draw lines around areas they found similar to the reference tiles. A third group of ten subjects were given the SOM selected reference tiles and asked to
categorize each tile in the whole image into categories represented by these reference tiles. The system exhibited no significant difference in precision from the human subjects but preformed less well on recall. Humans
selected more tiles viewed as similar and the top 5 system and subject tiles were consistently different. Both had difficulty with tiles where texture alone did not distinguish one from another. In tile groupings into
regions, humans out preformed the system on both measures but in image categorization no significant difference existed. Adding features other than texture may help performance which is close to inexpert human
performance.
How Can We Investigate Citation Behavior? A Study of Reasons for Citing
Literature in Communication
Donald O. Case and Georgeann M. Higgins
Case and Higgins review the
previous studies providing lists of reasons for author's citing behavior, and studies using these categories where investigators classify citation behavior on the basis of content analysis. They also reexamine the
smaller set of studies involving surveys of authors as to the reasons for their behavior. Choosing the two most highly cited authors appearing in both of two recent studies of the Communication literature all citations
to their work in the years 1995 and 1996 were collected. 133 unique citers were identified and sent 32 item questionnaires with the questions from a recent study in the Psychology literature. Returns from 56 were
received, 31 for author A and 25 for author B, and responses for the two authors were not significantly different. No new reasons for citation were identified. The top reasons were a review of past work, acting as a
representative of a genre of studies, and as a source of a method. Negative citation is quite rare. Twenty five not redundant items with some indication of importance were subjected to a factor analysis. Seven factors
explain 69% of the variance; classic citation, social reasons, negative citation, creative citation, contrasting citation, similarity citation, and cite of a review. Factors predicting citation are; perception of
novelty and representation of a genre, perception that citation will promote cognitive authority of the citing work, and perception that the cited item deserves criticism.
Children's Use of the Yahooligans! Web Search Engine: I. Cognitive,
Physical, and Affective Behaviors on Fact-Based Search Tasks
Dania Bilal
In the Bilal study twenty two middle school students
were assigned a question to search in Yahooligans! as part of their Science class. The teacher provided ratings of the children's topic knowledge, general science knowledge, and reading ability. A quiz administered to
the students indicated knowledge of the Internet and of Yahooligans! in particular. Lotus ScreenCam was used to record 18 of the student system interactions. Student's transcribed moves were classified and counted with
a score of one (relevant) for selection of a link that appears appropriate and leads to the desired information; .05 for the selection of a link that appear appropriate but is not successful, and 0 to the selection of
links that give no indication of information leading to success. Weighted effectiveness and efficiency scores are then computed.
Thirty six percent initially browsed subject categories while the rest entered
single or multi-word concepts. Key words and in some cases natural language were used in subsequent moves despite the fact that Yahooligans! does not support natural language search. Subsequent activity mixed browsing
with term search. Looping and backtracking were very common but the go button using the search history links was unused. Most children scrolled but not often the complete page. Half were successful but all were
inefficient.
Ethnomethodologically Informed Ethnography and Information System Design
Andy Crabtree, David M. Nichols, Jon O'Brien, Mark Rouncefield, and Michael
B. Twidale
Crabtree et al. object to traditional ethnographic analysis as applied to information problems on the basis that the application of pre-defined rules and procedures yields an organization of the activity observed from
the point of view of the analyst rather than that of the participants. Such a ``constructive analysis'' approach does not describe the actual activities, but in the name of objectivity imposes a structure which obscures
the real world practices through which subjects make sense of their surroundings, and produce information.
Ethnomethodology emphasizes rigorous thick description of local practices by assembling concrete cases of
preformed activity as the direct units of analysis. EM analysis attempts to generate a description in great detail of how the described activity could be reproduced in and through the same practices. Such description
provides a sense of the real world aspects of a socially organized setting to systems designers and thus provides the exceptions, contradictions, and contingencies of the activities that otherwise might not be evident.
Practitioners of ethnography and computer system design have quite different cultures but communication can lead to far better design practices. .