Journal of the Association for Information Science


Bert R. Boyce




The Web as a Classroom Resource: Reactions from the Users
Andrew Large and Jamshid Beheshti    
Published online 3 August 2000
    We begin with Large and Beheshti's examination of the reactions to Web searching of 53 grade six students almost half of whom had previously searched the Web at home. The students were divided into 20 groups and 78 search sessions of 30 minutes each were conducted with a goal of producing a poster on a winter Olympic sport. Of 50 posters submitted 47 included at least one downloaded image. The fifty completers were interviewed on tape and from their responses three groups emerged; technophiles, who were sold on the web as a source, traditionalists, who preferred traditional information sources, and pragmatists, who saw the web as a supplement to other sources.

   Few students expressed vocabulary problems, and seemed to find material from the web easier to rewrite than that from other sources. Only a few had any question as to the authority of material found. Pictures were used to promote viewer interest, and to take up space. However, when space was at a premium, text received priority. Color was considered important. Browsing was extensive, but comments supporting serendipitous results were rare. Web accessibility was seen as an advantage versus printed sources, but some students found the web harder to use. The class split on which method was faster.













Maps of Information Spaces: Assessments from Astronomy
Philippe Poincot, Soizick Lesteven, and Fionn Murtagh    
Published online 3 August 2000

Poincot et al. use Kohonen's self organizing map, SOM, as a means to create a visual user interface for retrieval of astronomy literature. Using the papers of the journal Astronomy and Astrophysics from 1994-98, each of 6500 documents is described by a vector of 269 keywords and located in a random grid position. Objects are then moved to the position where another object is described by a similar vector until a clustering is achieved. Large clusters of documents trigger the production of local maps. A search is conducted by entering keywords, which create a map of the documents containing those words, a map that displays clusters by displaying the number of documents in a position together with a theme label. A comparison with the Astrophysics Data System, ADS, was carried out using a starting document that generated similar documents using assigned keywords and a second search that used full text of the starting document. Both results were compared to the SOM results where the node with the starting document and 8 surrounding nodes produced an initial hit list, and the documents on this list, reprocessed through SOM produced a second smaller list. Eight subject experts evaluated the relevance of the results. Using assigned key words both systems retrieved the same 11 relevant documents but SOM retrieves an additional 3 that ADS does not and misses 5 that it does retrieve. Using full text queries, both systems retrieved the same 15 relevant documents but SOM misses an additional 10 that ADS does retrieve. The systems appear complementary.














 Aboutness from a Commonsense Perspective
P.D. Bruza, D.W. Song, and K.F. Wong
 Published online 11 August 2000
   Bruza et al. present an axiomatic structure characterizing aboutness in a set of information carriers. They demonstrate that overlap aboutness, a condition where two information carriers share and contain a third carrier, is characterized by reflexivity, and right and left monotonicity. But monotonicity means that once a document is determined to be about a query, no modification of the query can remove that document. Precision must
suffer. Practically this may be overcome by using threshold values, but is unsound theoretically. Commonsense aboutness, is similar but uses restricted monotonicity. Such ``containment'' models are exact match models and would include the Boolean approach. Probabilistic models would be fully non-monotonic. Potential for precision is linked to non-monotonicity. . 








What is a Collection?
 Hur-Li Lee
 Published online 3 August 2000
   Traditionally library collections are tangible and owned and available to a specified community. The use of selected remote virtual sources brings doubt to tangibility. Leased books, interlibrary loan and remote databases also question ownership as a criterion. Lee sees the intended user community as the key defining element of the modern collection and
ownership, collocation, format, and tangibility as no longer its defining characteristics.







 A Comparison of Techniques to Find Mirrored Hosts on the WWW
Krishna Bharat, Andrei Broder, Jefferey Dean, and Monika Henzinger
 Published online 11 August 2000
   To detect mirrored hosts. Bharat et alia, test four classes of algorithms using only URL information against a slow validation technique which compares fetched documents. Detecting mirrored hosts is useful in order to suppress them in index displays, and in the computation of link based authority scores. Host A mirrors host B if and only if for every document on A there is a highly similar document on B with the same path and vice versa. Highly similar, in a large file web environment, means that the text has been broken into a given number of strings of tokens of a given length and count of the intersection of these strings divided by the count of their union gives a result between 0 and 1 where approaching one indicates a high resemblance.
   One approach was to use highly similar IP addresses where the clusters produced are small, a second, highly similar URL strings with four vector match methods, a third adds hyperlinks to the document URLs, and a fourth considers all documents on a host as one large document and analyses their similarity. The database used was an AltaVista crawl with URLs from hosts that provided less than 100 URLs removed resulting in 140.6 million URLs from 233,035 hosts. Each algorithm produced a ranked list of probable mirrored hosts and the top 30,000 host pairs from each list were tested. Precision at rank k is the fraction of correct host pairs in the first k pairs, recall at rank k is number of correct host pairs in the first k pairs over the number of distinct correct host pairs, and for relative recall at k, over the number of correct host pairs found in the rankings of all algorithms up to k. Single algorithm approaches are limited in terms of recall but a combination of five algorithms gave a precision of .57 at a recall of .86.















 Partial Orders and Measures for Language Preferences
Leo Egghe and Ronald Rousseau
 Published online 21 July 2000

   Egghe and Rousseau investigate the rate at which references from papers in a language are in that language, i.e. the (linguistic) self citation rate. In a small scope of publication language a high self citation rate indicates parochialism, but in a high publication share language, it is to be expected. ROLP measures provide indications of relative own language reference, normalizing for publication share. Representing this measure as a curve from (0,0) (publication share,0) (pub. share, self cite rate) (1,self cite rate) ((1,1), two points giving the same curve would yield an equivalent measure, and for the set of all ROLP curves Ri, if R2 is at no point situated under R1, then R1 -< R2, a partial order relationship.
   The openness of a language is the degree to which it cites another language. Here, the relative citing rate is only one parameter. The relative size of the two languages is also important. The larger the citing language the larger openness should be, while the larger the cited language the smaller openness should be. Thus openness can be represented by a three vector model and a partial order defined on the openness solids created such that for the set of all openness solids Si, if S2 is at no point situated under S1, then S1 --< S2, a partial order relationship. A minimum requirement for any openness or ROLP measure is that it respect these partial orders.













Protein Annotators' Assistant: A Novel Application of Information Retrieval Techniques
Michael J. Wise
Published online 8 August 2000
   According to Wise, the Protein Annotator's Assistant, PPA, takes protein names and returns from databases of proteins represented as alphabetic strings with sequence and related information on function, lists of key words and phrases with a list of the proteins associated with each phrase. A new protein's function can be hypothesized by looking at the known functions of similar proteins. PPA extracts keywords, descriptors, and words from comment fields, runs them against a 24,611 word stop list, using stemming, to remove common English words. A 176 accept word list is also utilized. This overrides the stop list stemming rules creations, which sometimes have unfortunate results, and allows conditional words, useable only as phrase components, to remain. Comparison is done first using a 256 bit string representing the combined hash values of the keywords, then for
non zero results, a comparison of hash values, and finally the words themselves a process that greatly reduces character string comparisons. The list of protein pairs and probability values is passed to the server which clusters the data. The user then receives, a list of proteins not in the database, a list of those with no terms in common, and a list of the user supplied proteins, plus the keywords shared by at least two input list proteins and for each of these, the proteins mentioning it.














Drexel's Information Science M.S. Degree Program, 1963-1971: An Insider's Recollections
Barbara Flood
Published online 11 August 2000

   We conclude with Flood's description of the eight years of the Information Science Master's degree program at Drexel between 1963 and 1971, based upon her personal experience in the program as a student and faculty member from 1965 to 1975, Dean Harvey's 1967 report and the thesis of the first student in the program. Students were required to have a statistics course for admission, generally had science backgrounds, and
typically were working mature professionals. Extensive use was made of part time faculty available in industry in the area. There was an introductory course, and courses in publication, resources in science and technology, information center administration, abstracting and indexing, computer programming, instrumentation (which actually meant variable length string programming for text processing), system design, management, search strategy, research methods and thesis. The program ended as a separate faculty and degree when the faculties were merged and emphasis moved toward a Ph.D. program.














New Computer-Based Library Information Systems Designing Techniques, by
Madan Mohan Kashyap
Donald R. Smith




Data on the Web: From Relations to Semistructured Data and XML, by
Serge Abiteboul, Peter Buneman, and Dan Suciu
Randy Raphael




Online Retrieval: A Dialogue of Theory and Practice, by Geraldine Walker and Joseph Janes
Lynn D. Lampert






 Albert Henderson



Differences between Novice and Experienced Users in Searching Information on the World Wide Web
Charles T. Meadow




Rejoinder: Differences between Novice and Experienced Users in Searching Information on the World Wide Web
 Ard. W. Lazonder




2000 , Association for Information Science