In this issue
Bert Boyce
Published Online: 3 Dec 2002


Matchsimile: a Flexible Approximate Matching Tool for Searching Proper Names
Gonzalo Navarro, Ricardo Baeza-Yates, Jo o Marcelo Azevedo Arcoverde
Published Online: 27 Nov 2002

In this issue we begin with a description by Navarro et al. of Matchsimile, a software application designed to search simultaneously for thousands of distinct personal and corporate names in text where authority control cannot be assumed and duplication, abbreviation, omission, insertion, and transposition are present. It was designed for the Portuguese language, and utilizes Brazilian cultural rules for name formation. The user is permitted to establish weights for various transformations needed to transform the text word to the pattern word and threshold numbers for acceptance.

Matchsimile normalizes text words, recognizes their similarity with target words stored in a trie structure, links them to the patterns containing those words, and finally recognizes phrase patterns. The cost of processing M patterns on N megabytes of text was N   (2.05 + 5.96 M0.6 + 0.01 M) on a Sun UltraSparc-1 with Solaris 2.5.1. 













Browsing-based Conceptual Information Retrieval Incorporating Dictionary Term Relations, Keyword Association, and a User's Interest
Makoto Nakashima, Keizo Sato, Yanhua Qu, Tetsuro Ito
Published Online: 27 Nov 2002

Nakashima et al. consider that an initial search engine query on the World Wide Web creates a personal digital library containing the desired documents, which must then be browsed by the user for a precision search. The traditional top-down browse may not be the most effective strategy,given that the query terms on which the engine's ranking algorithm is based may not adequately reflect the user' desires. They use standard controlled vocabularies to replace and expand query terms, but also create a local dictionary as the personal digital library is browsed by storing documents judged relevant and not relevant in separate sets and using their terminology to suggest the most likely remaining documents to view. The standard thesauri terms replace any word they include, and closeness for document ranking is measured by finding the most specific generic term for both document and query terms. The initial display organization is maintained, but when documents have been judged relevant or not, those yet-to-be-judged documents with several terms in common with documents judged relevant are considered to be worth examining, and suggested to the user.

Experiments on the CACM, MED, and CISI collections using the ACM Classification, MeSH, and the ASIS Thesaurus were preformed using the provided relevance judgements with ranking compared to a ranking by cosine function. Both the Global dictionary expansion and the local dictionary feedback system were statistically superior in average precision for 10 recall points. The combination of the two techniques improved performance over cosine ranking by 66% in CACM, 34% in MED, and 33% in CISI.










Scholarly Use of the Web: What Are the Key Inducers of Links to Journal Web Sites?
Liwen Vaughan, Mike Thelwall
Published Online: 27 Nov 2002

Vaughan and Thelwall find that both the age of a journal's Web site and the extent of its provided content are positively correlated to the ratio of the journal's site's link counts to the journal's impact factor. Age was determined by use of the WayBack Machine, an interface to a database of The Internet Archive that will report the date of each time a site has been crawled by a search engine. Because it is possible for sites to deny access to crawlers, URL's change, and many sites will not be crawled immediately after their initiation, if at all; the method is imperfect, but powerful. Alta Vista was used for link counts during low use periods. These counts will vary based on how busy the servers are, because they are based upon an estimate from a sample count whose size is dependent upon available resources. Content extent was categorized as either: basic journal description, access to table of contents, access to abstracts, or full-text access. Data was collected on 38 library and information science journals and 88 law journals. To control for the effect of journal impact factors on link counts, the link/impact factor ratio was used rather than simple link counts. The Kruskal-Wallis test gave significant differences among the content groups, and age correlated with linkage using the Spearman coefficient in both disciplines.












Using the User's Mental Model to Guide the Integration of  Information Space into Information Need
Charles Cole, John E. Leide
Published Online: 27 Nov 2002

Cole and Leide consider whether visualization will assist undergraduate students in the success of their search for information for essay writing. To determine if undergraduates could produce a mental representation of interconcept relations between concepts related to their topics, 33 students were interviewed prior to presenting their topic to their instructor. Two groups were randomly chosen: one, the precision group, asked to supply four or five key words or phrases outlining their topic; the second, or recall-visualization group, was asked to take a further step and convert these terms into a visualization by converting them into circles where size represents importance, and closeness of the circles indicates closeness of concept association. The student then joined the authors for an automated search of history databases, with the recall group finding a desired citation and then adding terms to expand to about 200 citations. The precision group did the same but expanded by staying close to the original search terms to generate about 10 citations.

The high recall group received an objective visualization, based on term counts and the author's view of relationships, and were asked to compare it to their own. Using t-tests, the marks received by the students on their essays were not statistically different among the two groups and a third that was graded but did not participate in the study. Thus, there is no evidence to support visualization as a technique for improving essay grades. However, the high recall nature of the visualization searches did not adversely affect grading.










a Bit More to It: Scholarly Communication Forums as Socio-technical Interaction Networks
Rob Kling, Geoffrey McKim, Adam King
Published Online: 27 Nov 2002

Kling, McKim, and King focus on what they term Scholarly Communication Forums (SCF), entities that permit communication among scholars, and which have been proliferating, particularly in the medium of computer-mediated exchange of information (e-SCF). They present a Socio-Technical Interaction Network (STIN) model of e-SCFs that fits better than what they term the standard model. The STIN model includes people, equipment, data, resources, documents, legal mechanisms, and resource flows with their social, economic, and political interactions. The standard model holds an actor's behavior is motivated by the information processing features of an e-SCF, and that such actor individually chooses to use or not use the system. Thus, the focus is on the individual and the features of the technology and not the characteristics of the groups and organizations involved. A STIN modeler will identify a population of interactors, core interactor groups, incentives, undesired interactions, existing communication forums, points of architectural choice, and resource flows prior to determining a configuration. Both standard and STIN modeling techniques are applied to HEP, FlyBase, ISWORLD, and HEP to illustrate the method.









Rotation and Scale Invariant Wavelet Feature for Content-based Texture Image Retrieval
Moon-Chuen Lee, Chi-Man Pun
Published Online: 27 Nov 2002

Lee and Pun want to do content based image retrieval that will be sensitive to changes in image scale and orientation. They convert the image into a log-polar image that does not vary under rotation and is nearly invariant under scale change. The transform is of O(n) complexity, where n is the number of pixels. The process causes an undesirable row shift effect, but a wavelet packet transform, which is row-shift invariant, will resolve this difficulty. However, the number of wavelet coefficients produced is large, and can be reduced by computing an energy signature for each subband of wavelet coefficients, sorting them and choosing only the most dominant. The sort complexity will be O(n log n), the computation of the signatures O(n), and thus over all O(n  log n).

Using 25 natural textures from the Brodatz texture album, Euclidian distance is computed between the query image and the database image after the above process is completed. The system outperforms the traditional wavelet packet signature feature.





Information Science Research Agenda in Slovakia: History and  Emerging Vision
Jela Steinerov
Published Online: 27 Nov 2002

Steinerova provides a summary of library and information science education and research in Slovakia since 1990. New curricula have been developed and the European Credit Transfer Scheme adopted. A terminologic and encyclopedic dictionary of library and information science has been produced. Research interests are moving beyond automation toward social and human contexts.


Empirical Evidence of Self-organization?
Peter van den Besselaar
Published Online: 27 Nov 2002

Finally, Van den Besselaar comments on Leydesdorff and Heimeriks' JASIST article that found a relationship between words in titles and the region of origin of documents by discriminant analysis using title word sets to predict the region of production in 78% of the cases in a Biotechnology literature. Van den Besselaar repeats the method with an information science and a science and technology literature finding an even higher percentage of correct classification. He then points out that the data used do not meet the conditions for discriminant analysis in that the independent variables are nominal rather than on an interval scale, which has the effect, because of the near uniqueness of the words, of making it trivial to find a relationship with any classification. Indeed, random groupings of the Biotechnology documents can equally well be predicted. Testing random splits of the database yields strong prediction for the first half, but less than the a priori probabilities on the second half, implying every test results in a different model and the relation of region to word use needs to be rejected.


Book review
Lisa A. Ennis
Published Online: 3 Dec 2002

Historical Information Science: An Emerging Unidiscipline. Lawrence J. McCrank. Medford, NJ: Information Today, 2001; 1192 pp. Price $149.95 (ISBN: 1-57387-071-0)


Reaction to a book review
Ron Day
Published Online: 27 Nov 2002


