Journal of the Association for Information Science and technology


 Bert R. Boyce

In this issue our first and last papers are concerned with document representation by statistical analysis of text and the second, third and fourth with citation analysis. The fifth paper describes a method of visualization of WWW log data.  






A CorpusBased Approach to Comparative Evaluation of Statistical Term Association Measures
Young Mee Chung and Jae Yun Lee
Published online 17 January 2001

       Chung and Lee examine six term association measures, cosine, Jaccard, mutual information, Yule's coefficient of colligation, [chi]2, and likelihood ratio, to evaluate their similarity. The six measures are applied to 301,046 term pairs from the Korea Integrated News Database system The percentage of overlap in the first ten pairs, the first 20 pairs, and each increment of ten up to 1000 pairs of the six ranked lists is computed and agreement ratios computed and plotted. The top 10,000 term pairs for from each list were clustered by the complete linkage method, and the clusters compared using Dice's coefficient. The lists were also compared using Person's correlation coefficient, and subjected to multidimensional scaling. Yule and mutual information are most similar overall, and seem to overestimate low frequency terms, while the others all demonstrate very similar behavior for terms of high frequency. [chi]2 is least effected by term frequency, and cosine and Jaccard tend to stress high frequency terms.











Contrasting Views of Software Engineering Journals: Author Cocitation Choices and Indexer Vocabulary Assignments
Linda S. Marion and Katherine W. McCain
Published online 8 January 2001

    Using the 1996 Journal Citation Reports and the 1992 to 1998 issues of six journals suggested by a software engineer and supplemented with a journal focused on the objectoriented approach, Marion and McCain identified parent journals providing 25% of the cites by the seed journal authors, and child journals that generated 50% of cites to the seed group. Other candidate core lists came from journals frequently indexed by a set of INSPEC Thesaurus terms and codes, derived by ranking the terms and codes used to describe the papers in the seed journals and collecting two groups, each of whose disjunction was searched on the INSPEC database and the resulting journal lists ranked as to frequency. The associated codes, intersected with the term ``software engineering'' were also searched, and the titles ranked, to produce two more candidate lists. The union of all the JCR journals and the top twenty on each INSPEC list yielded 32 journals. Cocitation counts were collected and journals retained with a mean cocitation rate of seven or more. Three additional journals were eliminated because of outlier status, leaving 23 core journals. A matrix of Pearson correlations based upon cocitation counts was then created for use in cluster analysis, multidimensional scaling, and pathfinder analysis. Descriptor profiles were created by searching each journal's articles and ranking the assigned descriptors which were then aggregated for the journal resulting in a 23 journal 1716 term matrix. Correlations between descriptor count columns indicated journal similarity. Cluster analysis of the descriptor data shows two isolate journals, and four clusters: object orientation, systems management, data management, and a general class. In MDS the horizontal axis appears to represent the continuum from programminginthesmall to programminginthelarge, or perhaps development from design through maintenance. The vertical axis is anchored by the outliers representing computer education and knowledge based systems. Cluster analysis of the citation data produces similar clusters but IEEE Transactions on Software Engineering moves from the General cluster to a peripheral position with computer education. The MDS map is more complex seemingly calling for a shift to three dimensions. The x axis remains concerned with the basic programming to system development continuum, the z axis is likely representative of the distinction between system programming and more general systems development, and the y axis is likely the range from formal methods to applications. In both data methods the pathfinder analysis graphically shows the levels of connectedness of the journals and the variations that occur.






















The Effect of the Web on Undergraduate Citation Behavior 1996-1999
Philip M. Davis and Suzanne A. Cohen
Published online 8 January 2001

       Davis and Cohen collected 68 undergraduate student microeconomic term papers from 1996 and 69 from 1999 and extracted the bibliographies. These were coded as book, journal, magazine, newspaper, Web, other, or unidentifiable. Web source citations were verified online to see if they still existed and were classed as: found directly, not found directly but found elsewhere, found after correcting a typographical error, and not found (after a site and Google search).

   The average number of citations increased form 11.6 in 1996 to 11.9 in 1999. The mean number of journals and magazines did not change significantly. Overall median citations increased form 10 to 12. Book citations dropped from 30% to 19%, Web citation went from 9% to 21%, and newspapers increased form 7% to 16%. There was a significant decline in the use of books and journals in favor of the use of newspapers and magazines interpreted as a decline in the use of scholarly materials. For 1999 URLs, 55% went directly to a cited document, 19% were found elsewhere, and 10% contained errors. 16% were not found. Of the 1996 citations only 18% of the URLs still led directly to the cited document, 26%were found elsewhere, 3% had errors and 53% could not be found. The authors believe stricter guidelines for acceptable citations are called for, as is the creation of scholarly portals, and increased instruction on resource evaluation













 Fitting the Jigsaw of Citation: Information Visualization in Domain Analysis
 Chaomei Chen, Ray J. Paul, and Bob O'Keefe
 Published online 22 January 2001

   Chen, Paul, and O``Keefe, create a visualization of the discipline of computer graphics from the author cocitations of contributors above a five citation threshold in 18 years of IEEE Computer Graphics and Applications using pathfinder network scaling. Raw cocitation counts were transformed into Pearson's correlation coefficients between author cocitation profiles and a principal component analysis preformed. The identified dimensions were associate with subfields of the domain by examining the work of the leading authors in each of five factors. The three most significant factors were used to color the appropriate areas of the spatial model produced, and the other two dimensions shown in a series of animated frames. 









Using Interactive Visualizations of WWW Log Data to Characterize Access Patterns and Inform Site Design
 Harry Hochheiser and Ben Shneiderman
Published online 5 January 2001

   Hochheiser and Shneiderman describe Spotfire which generates a variety of interactive visualizations of the usage of World Wide Web servers based upon activity logs. Preprocessing includes removal of requests for nonHTML objects, and local domain requests. The resulting file includes client host names, timestamp, URL, Category, HTTP status, bytes delivered, HTTP referer, and user agent. Displays of various pairs on Cartesian coordinates provides information on links being used, time of site visits, and the locations of visitors. With the addition of color and zooming and filtering capabilities, it is possible to explore data across several dimensions to reflect the current interests of the provider.









Effective Ranking with Arbitrary Passages
Marcin Kaszkiel and Justin Zobel
Published online 22 January 2001

   Using cosine and a pivotedcosine similarity measure with a slope set to 0.2 and thus a normalization factor favoring longer documents, Kaszkiel and Zobel utilize five TREC test collections with both short (words from title fields) and long (full TREC topics) queries, average eleven point precision, and precision at 5, 10, 20, 30, and 200 document cutoffs, to compare various passage types as document representatives. Here, passages are constructed using Hearst's TextTile algorithm, where token sequences are conjoined into blocks that overlap with one another, and similarities between adjacent blocks are used to determine topic shifts, and thus the borders of two sets of blocks called tiles. These are supplemented by passages in the form of fixed length windows at 150 and 350 tokens, and discourse passages which use the document's paragraph or section structure. Retrieval based on passages is up to 50% better than that based on whole document ranking. However, using the Wilcoxon signed rank test only a small number of the improvements are significant. Passages based on pages most often show the strongest improvement. Pivoted cosine similarity improves performance in collections where document length varies.    The possibility of arbitrary fixed length passages was investigated by taking 12 window sets from 50 to 600 words, starting at 25 word intervals. Using only those passages containing the rarest query terms, similarities are computed for the passages and the query. In all test collections and for both short and long queries, fixed length arbitrary passages were found significantly superior to document rankings, but not significantly superior to predefined passages like pages. No particular length of passage appeared to have any advantage. Variable length arbitrary passages can be created by choosing from the 12 lengths based upon their success as fixed length passages with long and short queries. This variation provides significant improvement over full document for all but one collection.


















Information Appliances and Beyond: Interaction Design for Consumer Products, Eric Bergman, Editor
Charles Hanson  













2001 , Association for Information Science and Technology