|
| |
Bert R. Boyce In this issue our first and last papers are concerned with document representation by statistical
analysis of text and the second, third and fourth with citation analysis. The fifth paper describes a method of visualization of WWW log data. |
281 |
RESEARCH |
| |
A CorpusBased Approach to Comparative Evaluation of Statistical Term Association Measures Young Mee Chung and Jae Yun Lee Published online 17 January 2001 Chung and Lee examine six term association measures,
cosine, Jaccard, mutual information, Yule's coefficient of colligation, [chi]2, and likelihood ratio, to evaluate their similarity. The six measures are applied to 301,046 term pairs from the Korea Integrated News
Database system The percentage of overlap in the first ten pairs, the first 20 pairs, and each increment of ten up to 1000 pairs of the six ranked lists is computed and agreement ratios computed and plotted. The top
10,000 term pairs for from each list were clustered by the complete linkage method, and the clusters compared using Dice's coefficient. The lists were also compared using Person's correlation coefficient, and subjected
to multidimensional scaling. Yule and mutual information are most similar overall, and seem to overestimate low frequency terms, while the others all demonstrate very similar behavior for terms of high frequency. [chi]2
is least effected by term frequency, and cosine and Jaccard tend to stress high frequency terms.
|
283 |
| |
Contrasting Views of Software Engineering Journals: Author Cocitation Choices and Indexer Vocabulary Assignments Linda S. Marion and Katherine W. McCain Published online 8 January 2001 Using the 1996 Journal Citation Reports and the 1992 to 1998
issues of six journals suggested by a software engineer and supplemented with a journal focused on the objectoriented approach, Marion and McCain identified parent journals providing 25% of the cites by the seed journal
authors, and child journals that generated 50% of cites to the seed group. Other candidate core lists came from journals frequently indexed by a set of INSPEC Thesaurus terms and codes, derived by ranking the terms and
codes used to describe the papers in the seed journals and collecting two groups, each of whose disjunction was searched on the INSPEC database and the resulting journal lists ranked as to frequency. The associated
codes, intersected with the term ``software engineering'' were also searched, and the titles ranked, to produce two more candidate lists. The union of all the JCR journals and the top twenty on each INSPEC list yielded
32 journals. Cocitation counts were collected and journals retained with a mean cocitation rate of seven or more. Three additional journals were eliminated because of outlier status, leaving 23 core journals. A matrix
of Pearson correlations based upon cocitation counts was then created for use in cluster analysis, multidimensional scaling, and pathfinder analysis. Descriptor profiles were created by searching each journal's articles
and ranking the assigned descriptors which were then aggregated for the journal resulting in a 23 journal 1716 term matrix. Correlations between descriptor count columns indicated journal similarity. Cluster analysis of
the descriptor data shows two isolate journals, and four clusters: object orientation, systems management, data management, and a general class. In MDS the horizontal axis appears to represent the continuum from
programminginthesmall to programminginthelarge, or perhaps development from design through maintenance. The vertical axis is anchored by the outliers representing computer education and knowledge based systems. Cluster
analysis of the citation data produces similar clusters but IEEE Transactions on Software Engineering moves from the General cluster to a peripheral position with computer education. The MDS map is more complex
seemingly calling for a shift to three dimensions. The x axis remains concerned with the basic programming to system development continuum, the z axis is likely representative of the distinction between system
programming and more general systems development, and the y axis is likely the range from formal methods to applications. In both data methods the pathfinder analysis graphically shows the levels of connectedness of the
journals and the variations that occur. |
297
|
| |
The Effect of the Web on Undergraduate Citation Behavior 1996-1999 Philip M. Davis and Suzanne A. Cohen Published online 8 January 2001 Davis and Cohen collected 68 undergraduate
student microeconomic term papers from 1996 and 69 from 1999 and extracted the bibliographies. These were coded as book, journal, magazine, newspaper, Web, other, or unidentifiable. Web source citations were verified
online to see if they still existed and were classed as: found directly, not found directly but found elsewhere, found after correcting a typographical error, and not found (after a site and Google search).
The average number of citations increased form 11.6 in 1996 to 11.9 in 1999. The mean number of journals and magazines did not change significantly. Overall median citations increased form 10 to 12. Book
citations dropped from 30% to 19%, Web citation went from 9% to 21%, and newspapers increased form 7% to 16%. There was a significant decline in the use of books and journals in favor of the use of newspapers and
magazines interpreted as a decline in the use of scholarly materials. For 1999 URLs, 55% went directly to a cited document, 19% were found elsewhere, and 10% contained errors. 16% were not found. Of the 1996 citations
only 18% of the URLs still led directly to the cited document, 26%were found elsewhere, 3% had errors and 53% could not be found. The authors believe stricter guidelines for acceptable citations are called for, as is
the creation of scholarly portals, and increased instruction on resource evaluation
|
309
|
| |
Fitting the Jigsaw of Citation: Information Visualization in Domain Analysis Chaomei Chen, Ray J. Paul, and Bob O'Keefe Published online 22 January 2001 Chen, Paul, and O``Keefe, create a visualization of
the discipline of computer graphics from the author cocitations of contributors above a five citation threshold in 18 years of IEEE Computer Graphics and Applications using pathfinder network scaling. Raw cocitation
counts were transformed into Pearson's correlation coefficients between author cocitation profiles and a principal component analysis preformed. The identified dimensions were associate with subfields of the domain by
examining the work of the leading authors in each of five factors. The three most significant factors were used to color the appropriate areas of the spatial model produced, and the other two dimensions shown in a
series of animated frames. |
315
|
| |
Using Interactive Visualizations of WWW Log Data to Characterize Access Patterns and Inform Site Design Harry Hochheiser and Ben Shneiderman Published online 5 January 2001 Hochheiser and Shneiderman describe Spotfire which generates a
variety of interactive visualizations of the usage of World Wide Web servers based upon activity logs. Preprocessing includes removal of requests for nonHTML objects, and local domain requests. The resulting file
includes client host names, timestamp, URL, Category, HTTP status, bytes delivered, HTTP referer, and user agent. Displays of various pairs on Cartesian coordinates provides information on links being used, time of site
visits, and the locations of visitors. With the addition of color and zooming and filtering capabilities, it is possible to explore data across several dimensions to reflect the current interests of the provider. |
331
|
| |
Effective Ranking with Arbitrary Passages
Marcin Kaszkiel and Justin Zobel Published online 22 January 2001 Using cosine and a pivotedcosine similarity measure with a slope set to 0.2 and thus a normalization factor favoring longer
documents, Kaszkiel and Zobel utilize five TREC test collections with both short (words from title fields) and long (full TREC topics) queries, average eleven point precision, and precision at 5, 10, 20, 30, and 200
document cutoffs, to compare various passage types as document representatives. Here, passages are constructed using Hearst's TextTile algorithm, where token sequences are conjoined into blocks that overlap with one
another, and similarities between adjacent blocks are used to determine topic shifts, and thus the borders of two sets of blocks called tiles. These are supplemented by passages in the form of fixed length windows at
150 and 350 tokens, and discourse passages which use the document's paragraph or section structure. Retrieval based on passages is up to 50% better than that based on whole document ranking. However, using the Wilcoxon
signed rank test only a small number of the improvements are significant. Passages based on pages most often show the strongest improvement. Pivoted cosine similarity improves performance in collections where document
length varies. The possibility of arbitrary fixed length passages was investigated by taking 12 window sets from 50 to 600 words, starting at 25 word intervals. Using only those passages containing the
rarest query terms, similarities are computed for the passages and the query. In all test collections and for both short and long queries, fixed length arbitrary passages were found significantly superior to document
rankings, but not significantly superior to predefined passages like pages. No particular length of passage appeared to have any advantage. Variable length arbitrary passages can be created by choosing from the 12
lengths based upon their success as fixed length passages with long and short queries. This variation provides significant improvement over full document for all but one collection. |
344
|
| |
|
|
| |
|
|
| |
|
|
| |
|
|
BOOK REVIEWS |
| |
Information Appliances and Beyond: Interaction Design for Consumer Products, Eric Bergman, Editor Charles Hanson |
365 |
| |
CALL FOR PAPERS |
367 |
| |
|
|
| |
|
|
| |
|
|
| |
|
|
| |
|
|
|