Volume 54  Issue 8


In this issue
Bert Boyce


Graph Structure in Three National Academic Webs Power Laws with Anomalies
Mike Thelwall and David Wilkinson
Published online 16 April 2003

Thelwall and Wilkinson use crawls of university web sites in the UK, Australia, and New Zealand to generate all links targeted at same country university web sites which they then use to create a graph structure for study. Using Broder's study as a model they identify a strongly connected component, SCC, where one could start anywhere in the set and reach every other page, and an Out component whose pages can be reached from all strongly connected pages but provide no link back to that set. The other components in the Broder model are not accessible except with access to a major search engine database. In link and out link counts for all three university systems in both the Out and SCC components when graphed logarithmically display the linear nature which would indicate that power laws, and a success breeds success phenomena, are generally in effect. However, automatically generated pages, non-HTML web pages, and large resource-driven sites all were associated with anomalies in this observation.







Efficient Single-Pass Index Construction for Text Databases
Steffen Heinz and Justin Zobel
Published online 16 April 2003

Zobel and Heinz review file inversion processes for the creation of text indices and suggest an efficient single pass approach. Complete in memory indexing remains impractical for very large files. Current rapid algorithms require that the entire vocabulary of the collection be kept in memory. This approach creates inverted files in memory for sequences of documents until memory resources are exhausted, then transferring the lexicon and inverted file in lexicographical order to disk for subsequent merger. Each term is assigned a dynamic in-memory bi-vector that accumulates postings in a compressed d-gap format. The lexicon is maintained in a burst trie file structure where leaves are containers of strings with common prefixes. Performance on five gigabyte to twenty gigabyte files is fifteen to twenty percent faster than a sort based approach.









Automatic Construction of English/Chinese Parallel Corpora
Christopher C. Yang and Kar Wing Li
Published online 16 April 2003

Yang and Li describe the automatic matching of English and Chinese document titles, by character and word matching based upon the study of web pages within a site where some pages exist separately in each language. Word and character alignment is followed by redundancy resolution, and then title alignment takes place. English words are translated into Chinese character string words by dictionary lookup and the various possibilities matched with the Chinese titles using the longest common sequence of characters. Using Hong Kong Special Administrative Region government press releases and releases from the Hong Kong and Shanghai Banking Corporation, they find 31,567 in the Chinese language and 30,810 in English, but only 23,701 released in parallel. There are no links between the versions. With Recall as the number of system correct matches over the actual matches in the file, and Precision the number of correct system matches over the number of system matches, a test yields Precision in the range of .998 to 1.00 and recall from .806 to .948. Thus links to parallel documents in the other language could quite likely be automatically generated.








Mning Longitudinal Web Queries Trends and Patterns
Peiling Wang, Michael W. Berry, and Yiheng Yang
Published online 16 April 2003

Wang, Berry and Yang log hit counts and date stamped queries to the University of Tennessee website for a four year period as entered through the SWISH search engine as Boolean statements where spaces were considered to be AND operators. Queries were parsed into words and word pairs of adjacent words or words separated by one other word. (94% of queries contained three words or less) URLs were not parsed but treated as unusual queries. Null outputs exceed 30%. Queries averaged 2 words or 13 characters. Number of queries and the vocabulary used grow over time but the vocabulary is relatively small and includes a large number (26%) of misspelled words and personal names. Log plots of frequencies and ranks for both all words and words with unique frequencies overlap in the upper portion which is quadratic polynomial and diverge in the lower portion where the all word line becomes linear. Topics and search behavior vary little over the four year period. Websites could be improved by containing content identified from queries.










Students' Conceptual Structure, Search Process, and Outcome While Preparing a Research Proposal A Longitudinal Case Study
Mikko Pennanen and Pertti Vakkari
Published online 16 April 2003

Pennanen and Vakkari use 22 undergraduate psychology students doing Boolean searches on PsycINFO, a system with which they were unfamiliar, to investigate the relationship between their conceptual structure of their topic and their search process, and whether these relations vary depending upon their stage in the Kuhlthau model. Students searched both at the beginning and the end of their construction of a proposal, and each search was proceeded and followed by an interview. The thought process during search was vocalized and recorded, and transaction logs were also retained. They recorded the number of concepts used by a student, the proportion of sub-concepts included, and the proportion of concepts expressed in query terms. Retrieved useful references were recorded. The two main tactics used by the subjects were the adding of a conjoined term, and the replacement of an existing term with another. The students were able to translate into query terms only slightly more than half the concepts they identified. The subjects advanced significantly in terms of the Kuhthau model between their search sessions. Their conceptual structure was richer, search terms used increased, but references accepted as useful decreased. The proportion of concepts articulated in the query correlated significantly with the number of useful references found.








Information Science Abstracts Tracking the Literature of Information Science. Part 2 A New Taxonomy For Information Science
Donald T. Hawkins, Signe E. Larson and Bari Q. Caton
Published online 16 April 2003

Using 3000 Information Science Abstracts abstracts, Hawkins, Larson, and Caton test the validity of a new ISA classification structure for information science leading to the revision and fine-tuning of the structure. The structure was produced by collecting terms from available vocabularies grouped into 13 main headings. Each abstract was given only one classification number representing a main heading and a single sub-heading by each of the researchers. A review of the distribution of abstracts over section indicated the combination of some closely related categories and the presence of unclassifiable abstracts pointed to uncovered gaps. Only in 19% of the cases did all three disagree on the assignment of a main heading. A second test with 1265 abstracts showed that the abstracts were well distributed over what were now 11 main sections. Low posted sub-headings were examined but retained as growing areas. The taxonomy is included as an appendix.








Improving the Search Environment Informed Decision Making in the Search for Statistical Information
Stephanie W. Haas
Published online 16 April 2003

Studying the Bureau of Labor Statistics' LABSTAT database, Haas looks for searching decision points at which assistance for the searcher may be of value. A searcher has some measure of both search and domain knowledge and will need some knowledge of the way information is provided to make effective decisions. Transition points are identified between user vocabulary and Bureau of Labor Statistics (BLS) concepts, between BLS concepts and BLS data and information products, and between these products and the actual query. This suggests the need for help in concept definition, ambiguity resolution, and synonym usage which is not uncommon in retrieval systems, but also assistance in the choice of products through a matrix of specifiable categories with available surveys and series. The need to express a query for a chosen survey/series suggests a need for variable displays, information on the interaction of variable-value choices, and warnings of unusual situations.









Using the Mann-Whitney Text on Informetric Data
John C. Huber and Roland Wagner-Döbler
Published online 16 April 2003

Huber and Wagner-Dobler demonstrate a relatively simple procedure for implementing, using a spreadsheet, a Mann-Whitney test of the difference of two bibliometric samples which will take into account the large number of ties normally present in such data. Sources with the same count of publications are assigned the same rank where the value is the median of the number of such sources in both samples. The lower the p-level the higher the probability the samples are from different distributions. It is thus possible to determine if a change in productivity is due to factors beyond the change in number of sources. However, small samples with small differences will appear to be from the same distribution, and larger samples are necessary to overcome the effect of multiple ties.


Patents, Citations & Innovations A Window on the Knowledge Economy, by Adam B. Jaffe and Manuel Trajtenberg
Reviewed by Chaomei Chen
Published online 16 April 2003


Empirical Evidence of Self-Organization" a rejoinder
Loet Leydesdorff


Arguments for Epistemology in Information Science
Birger Hjorland

