Bert R. Boyce





Recollections of Irving H. Sher 1924-1996: Polymath/Information Scientist Extraordinaire
Eugene Garfield
Published online 18 October 2001

In this issue we begin with Garfield's recollections of Irving H. Sher, in which he reviews the contributions of Sher, both to his personal work and to the Institute for Scientific Information. Sher bore great responsibility for the development of the Permuterm Subject Index, and the Automated Subject Citation Alert System. He was the lead author on the first report of the citation characteristics of Nobel prize winners, and a key collaborator in the design of the impact factor metric.








Known-Item Online Searches Employed by Scholars Using Surname plus First, or Last, or First and Last Title Words
Frederick G. Kilgour
Published online 16 October 2001

Kilgore repeats an earlier experiment which supported the efficacy of the two or three word search for known items, by now limiting his searches to the MARC 100 and 245 fields in the University of Michigan online catalog, and testing the surname and two title word search's ability to produce a single 20 line minicat. Every book citation with a personal author found in the bibliographies of 10 scholarly monographs, each chosen from one of Dewey's ten classes, was searched by surname plus first title word, surname plus last title word and surname plus both words. The surname plus both words produced a 20-line minicat in 98.9% of the cases, and in 54.8% of searches only the desired record was displayed. Use of a third title word increased the percentage to 99,4%. In 13.1% of the cases the title was reported as not in the database. Of these 216 were verified as not in the database, 114 were citation errors that were verified and researched, 40 could not be verified, and 81 were discarded after verification as having corporate authors. This argues for the combining of the found item display with an availability display.












An Interpretive and Situated Approach to an Evaluation of Perseus Digital Libraries
Shu Ching Yang
Published online 31 October 2001

Yang studies Perseus, a hypermedia digital library on classical studies as a learning tool for classroom instruction. Data was collected on the experiences of five undergraduates taking a course in classical Greek studies who used the system exclusively and regularly in class and in their assignments. The class assignment and problem solving techniques using the system were discussed. As the students worked, their verbalized thought processes, decisions, and body language were recorded. The multiplicity of available links lead some students to cognitive overload, while others complained of fragmented instruction. Several complained about the unevenness of the material and the limitations of the system's path making tool.










Ranked Retrieval with Semantic Networks and Vector Spaces
Vladimir A. Kulyukin and Amber Settle
Published online 26 October 2001

Kulyukin and Settle create a logical model that generalizes and rigorously formalizes both the semantic network spreading activation model and the dot product version of the vector space model of retrieval systems, demonstrating that the two models are equivalent under ranked retrieval, by specifying algorithms to construct each model from the other. This suggests that tests comparing the two models may be really comparing differences in data and relevance judgments.








Reduction of the Dimension of a Document Space Using the Fuzzified Output of a Kohonen Network
Vicente P. Guerrero and Felix de Moya Anegon
Published online 31 October 2001

Guerrero and de Moya Anegon extracted 7758 unique words from the abstracts of the last 954 records in the Summer 1996 issue of Library and Information Science Abstracts, reduced the set to 7577 with a stop list, and to 5052 using the Porter Stemmer, then chose the 1200 with the greatest discrimination values. The terms were then weighted by the product of their occurrence in the document and the log of the total number of terms in the database over the number of terms in the document. The set was further reduced to 400 dimensions by using only vectors for documents that had been assigned one of seven subject terms. The documents and the terms were then clustered using a fuzzy value Kohonen Self-organizing Map technique and the clusters evaluated as to the grouping of LISA assigned terms. Degrees of membership for each item in each cluster are thus available.










The Scatter of Documents Over Databases in Different Subject Domains: How Many Databases Are Needed?
William W. Hood and Concepcion S. Wilson
Published online 26 October 2001

To investigate cross database subject scatter Hood and. Wilson created 14 queries of 240 characters or less in different topical areas that would retrieve fewer than 5000 records a year from DIALOG databases. The terms were searched in title and abstract fields from 1994 to 1998. Files in the DIALINDEX category ALL excluding newspapers and ontap files as well as Current Contents Search were used providing 373 databases. Fourteen DIALINDEX searches were then run and databases not supporting duplicate detection removed from the results before the remainder were sorted in decreasing frequency order. The top database was run against the query separately for each of the five years and the ``remove duplicates'' command issued. Then the top and second database were searched together, duplicates removed, and the process repeated until a cumulative frequency distribution of the non-duplicated records was created.

Considerable scatter occurs and in the worst case the most productive database provides only 19% of the citations. The distributions are hyperbolic with identifiable cores. However, the degree of scatter appears to be subject dependent with high concentration in one database and 80% coverage in five to eight files for four of the searches; moderate concentration in one database and 80% coverage in seven to ten files for five of the searches; and low concentration in one database and 16 to 19 files to get 80% coverage in the other four.















Effects of Link Annotations on Search Performance in Layered and Unlayered Hierarchically Organized Information Spaces
Landon Fraser and Craig Locatis
Published online 26 October 2001

Fraser and Locatis investigate the effect of adding annotations to hyperlinks both in shallow and deep link structures and the effect of annotations on searches with low word correspondence with the words in links, which they would term difficult searches. Accuracy was reaching the correct document section; efficiency the number of clicks necessary to do so, as well as elapsed search time. A 25 page document with obscure subject matter was supplemented with three additional documents to provide roughly equal treatment of four internal topics in 48 separate sub-sections. A three layered version was created, with a forth set of links to primary content. A un-layered version with 48 links from a table of contents with the same nested structure was also prepared, then both versions were supplemented with annotations to create two additional versions. Easy questions contained a word appearing in a link; medium questions contained such a word or a synonym but appearing in multiple links; difficult questions had no direct word correspondence. High School students were given three consecutive questions and assigned twenty to each of the four treatments. Link annotations had minimal effect on performance. Shallow spaces were easier and quicker to navigate. Attention seems to be paid to link wording rather than annotation wording.













The Self-Organization of the European Information Association: The Case of ``Biotechnology''
Loet Leydesdorff and Gaston Heimeriks
Published online 26 October 2001

Leyesdorff and Heimeriks select 711 biotechnology papers supplying 787 institutional addresses in Europe, Japan, or the United States in order to determine if this literature can be used to illuminate the interaction between the reorganization of the European science community through transnational collaboration and the transition to so called Mode-2 research with its emphasis on government, university and industry collaboration. The top 10% of title words yields 245 unique types and nine documents that included none of these words were excluded yielding a 778 by 245 document term matrix. Factor analysis of the words provides 99 factors with an eigenvalue above unity and thus reveals no inherent structure and leaves the choice of number of word groupings arbitrary. Discriminant analysis will successfully classify 77.6% of the papers geographically but this level is not statistically significant. The word frequency lists for these geographical sets can be correlated with the forced factor loadings on ten or less dimensions. A three-factor solution which suggests a geographically separate literature, makes the American set more correlated. A four-factor solution which would consider an international factor as well as the three geographical factors, shows us American and European correlations at almost equal strength to different factors. The European set is correctly classified by discriminant analysis in 86.5% of cases. Its disaggregation shows differences and thus seems to be a part of a global phenomenon not national interactions. The European vocabulary factor disappears after disaggeration..
















Natural Language Processing: Word Recognition without Segmentation
Khalid Saeed and Agnieszka Dardzinska
Published online 25 October 2001

To recognize Arabic script Saeed and Dardzinska bitmap a word without the dot components, find the cusps of letters, calculate the length of the vectors from the origin to these points, and the differences in length between successive vectors as values in a matrix. They use the lowest eigenvalues for a first feature vector and then calculate the angles of the original vectors and use the difference in their successive tangents to create a second matrix whose lowest eigenvalues will provide a second feature vector. These shapes will provide an initial classification. The width of the linear segments between the cusps, and the number and position of dots is also calculated and this information used if the initial shapes have not defined an existing word. The process was successfully tested on several script fonts.













